Closed pplaw closed 3 years ago
Hi Pui,
I theory you should be able use this information as to construct a genome, but you will need to adhere to the same rules that the script uses internally.
First of all, you cannot use /
characters in names as it would probably do bad things. I would recommend an underscore _
instead, so maybe:
--strain JF1_Ms
Next, the SNPs will need to be contained within a folder called:
SNPs_JF1_Ms
inside there, the script expects one file per chromosome, e.g. like so:
chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrMT.txt
chrX.txt
chrY.txt
So you will have to split your tsv file by chromosome. And finally, these file need to adhere to the following format:
>10
42459235 10 3101362 1 G/C 1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276 10 3102112 1 T/C 1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298 10 3102661 1 C/G 1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325 10 3103479 1 A/G 1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1
...
where the first line is the name of the chromosome (like in a FastA file).
this is also explained in the option:
--skip_filtering This option skips reading and filtering the VCF file. This assumes that a folder named
'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
are contained therein in the following format:
SNP-ID Chromosome Position Strand Ref/SNP
example: 33941939 9 68878541 1 T/G
I believe column 6 is optional, but the first 5 need to be present. It is sufficient to put 1 as strand if the sequence is based relative to the top strand. So you will need to reformat your file just a little, and you should be able to run it. If you have no one to do this for you I might be able to help.
I hope this makes sense?
Hi Felix
Thanks a lot for your advice! It does make sense to me. I have tried to reformat the tsv file that I have and can I please check with you whether txt files look like this will be fine? (e.g. for chr1.txt, in the order of: "SNP-ID", "Chromosome", "Pos", "Strand", "Ref/SNP")
Thanks
Best, Pui
Ref/SNP needs to be present without the double quotes, just
1 1 3000176 1 T/G
2 1 3000223 1 T/A
...
Otherwise it looks good to me!
Hi Felix,
Thanks a lot for your reply. I have now created the folder "SNPs_JF1_Ms" with the txt files in my working directory. However, when I tried to run the following command,:
SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1_Ms --skip_filtering
I have an error message returned: You need to provide a VCF file detailing SNPs positions with '--vcf_file your.file' (e.g.: --vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz). Please respecify!
Can I please check with you whether I need to enter something for the --vcf flag even if I am using --skip_filtering?
Many thanks
Hmm, can you just try to specify any file as --vcf any_file
, it should be skipped then. Maybe the logic needs to be changed to that specifying a VCF file is not enforced if you are using --skip_filtering
. Let me know if this works, else I'll take a look.
Hi Felix,
Thanks a lot for your reply. I tried to specify different files or folders in my working directory to --vcf and in all cases I have the same message returned:
OK, I'll try to make up a hardcoded version for you and will attach it here once I've got it.
Quick question, which version of SNPsplit are you using? While trying to adapt the genome preparation script I noticed that it just seems to work straight out the box of you will (version 0.4.0), but this is because --skip_filtering
has already hardcoded the list of chromosomes.
Command:
SNPsplit_genome_preparation --reference /bi//scratch/Genomes/Mouse/GRCm38/ --strain JF1_Ms --skip_filtering
Reference genome folder provided is /bi//scratch/Genomes/Mouse/GRCm38/ (absolute path is '/bi/scratch/Genomes/Mouse/GRCm38/)'
Summarising SNPsplit Genome Preparation Parameters
==================================================
Reading/filtering VCF file: No (skipped by user)
Reference genome: /bi/scratch/Genomes/Mouse/GRCm38/
N-masking: Yes
Full SNP genome: No
SNP strain: JF1_Ms
Using the following chromosomes (HARDCODED IN!!!):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Y MT
Skipped reading the VCF file and filtering SNPs again (specified by user)
Now reading in and storing sequence information of the genome specified in: /bi/scratch/Genomes/Mouse/GRCm38/
chr 1 (195471971 bp)
chr 10 (130694993 bp)
...
This uses standard mouse chromosomes 1-9, chrMT, chrX and chrY.
can you just try cloning the current development version and try again?
git clone https://github.com/FelixKrueger/SNPsplit.git
Hi Felix,
Thank you very much and sorry for not having the most up-to-date SNPsplit to start with (I was on 0.3.2). It is now working fine! Thank you very much!
Maybe this is also of interest for you, I have just added provisional support for the MGP v7 variants file, which appears to include the strains you are interested in. Please see the description here:
https://github.com/FelixKrueger/SNPsplit/blob/master/CHANGELOG.md#snpsplit_genome_preparation
As a final comment, I have conducted some sorting followed by IMPLICON processing for B6/JF1 mice, and it worked just fine!
Dear Felix,
My group is hoping to analyse some published data generated from a hybrid mouse train (C57BL/6J x JF1/Ms) in which SNP information for JF1/Ms is not available in the “mgp.v5.merged.indels.dbSNP142.normed.vcf.gz” vcf file from sanger institute website We have however managed to find the SNP info stored in a .tsv file for this strain in the following format:
Can I please check with you whether we can use this tsv file (or a reformatted version of this tsv file) for generation of the masked genome using the following command:
SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1/Ms --skip_filtering
Many thanks!
Best, Pui