FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Question about genome preparation for C57BL/6J x JF1/Ms hybrid mouse #41

Closed pplaw closed 2 years ago

pplaw commented 3 years ago

Dear Felix,

My group is hoping to analyse some published data generated from a hybrid mouse train (C57BL/6J x JF1/Ms) in which SNP information for JF1/Ms is not available in the “mgp.v5.merged.indels.dbSNP142.normed.vcf.gz” vcf file from sanger institute website We have however managed to find the SNP info stored in a .tsv file for this strain in the following format:

Chromosome Start_position End_position Ref Alt
1 3000176 3000176 T G

Can I please check with you whether we can use this tsv file (or a reformatted version of this tsv file) for generation of the masked genome using the following command:

SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1/Ms --skip_filtering

Many thanks!

Best, Pui

FelixKrueger commented 3 years ago

Hi Pui,

I theory you should be able use this information as to construct a genome, but you will need to adhere to the same rules that the script uses internally.

First of all, you cannot use / characters in names as it would probably do bad things. I would recommend an underscore _ instead, so maybe:

--strain JF1_Ms

Next, the SNPs will need to be contained within a folder called:

SNPs_JF1_Ms

inside there, the script expects one file per chromosome, e.g. like so:

chr10.txt
chr11.txt
chr12.txt
chr13.txt
chr14.txt
chr15.txt
chr16.txt
chr17.txt
chr18.txt
chr19.txt
chr1.txt
chr2.txt
chr3.txt
chr4.txt
chr5.txt
chr6.txt
chr7.txt
chr8.txt
chr9.txt
chrMT.txt
chrX.txt
chrY.txt

So you will have to split your tsv file by chromosome. And finally, these file need to adhere to the following format:

>10
42459235        10      3101362 1       G/C     1/1:101:31:0:284,101,0:255,90,0:2:51:31:0,0,22,9:0:-0.69311:.:1
42459276        10      3102112 1       T/C     1/1:127:50:0:288,164,0:255,151,0:2:44:50:0,0,22,28:0:-0.693147:.:1
42459298        10      3102661 1       C/G     1/1:127:59:0:290,192,0,.,.,.:255,178,0,.,.,.:2:48:59:0,0,34,25:0:-0.693147:.:1
42459325        10      3103479 1       A/G     1/1:127:49:0:288,161,0:255,148,0:2:57:49:0,0,17,32:0:-0.693147:.:1
...

where the first line is the name of the chromosome (like in a FastA file).

this is also explained in the option:

--skip_filtering              This option skips reading and filtering the VCF file. This assumes that a folder named
                              'SNPs_<Strain_Name>' exists in the working directory, and that text files with SNP information
                              are contained therein in the following format:
                                      SNP-ID     Chromosome  Position    Strand   Ref/SNP
                          example:   33941939        9       68878541       1       T/G

I believe column 6 is optional, but the first 5 need to be present. It is sufficient to put 1 as strand if the sequence is based relative to the top strand. So you will need to reformat your file just a little, and you should be able to run it. If you have no one to do this for you I might be able to help.

I hope this makes sense?

pplaw commented 3 years ago

Hi Felix

Thanks a lot for your advice! It does make sense to me. I have tried to reformat the tsv file that I have and can I please check with you whether txt files look like this will be fine? (e.g. for chr1.txt, in the order of: "SNP-ID", "Chromosome", "Pos", "Strand", "Ref/SNP")

image

Thanks

Best, Pui

FelixKrueger commented 3 years ago

Ref/SNP needs to be present without the double quotes, just

1        1      3000176    1        T/G
2        1      3000223    1        T/A
...

Otherwise it looks good to me!
pplaw commented 3 years ago

Hi Felix,

Thanks a lot for your reply. I have now created the folder "SNPs_JF1_Ms" with the txt files in my working directory. However, when I tried to run the following command,:

SNPsplit_genome_preparation --reference /ref_genome/mm10/ --strain JF1_Ms --skip_filtering

I have an error message returned: You need to provide a VCF file detailing SNPs positions with '--vcf_file your.file' (e.g.: --vcf mgp.v5.merged.snps_all.dbSNP142.vcf.gz). Please respecify!

Can I please check with you whether I need to enter something for the --vcf flag even if I am using --skip_filtering?

Many thanks

FelixKrueger commented 3 years ago

Hmm, can you just try to specify any file as --vcf any_file, it should be skipped then. Maybe the logic needs to be changed to that specifying a VCF file is not enforced if you are using --skip_filtering. Let me know if this works, else I'll take a look.

pplaw commented 3 years ago

Hi Felix,

Thanks a lot for your reply. I tried to specify different files or folders in my working directory to --vcf and in all cases I have the same message returned: image

FelixKrueger commented 3 years ago

OK, I'll try to make up a hardcoded version for you and will attach it here once I've got it.

FelixKrueger commented 3 years ago

Quick question, which version of SNPsplit are you using? While trying to adapt the genome preparation script I noticed that it just seems to work straight out the box of you will (version 0.4.0), but this is because --skip_filtering has already hardcoded the list of chromosomes.

Command:

SNPsplit_genome_preparation --reference /bi//scratch/Genomes/Mouse/GRCm38/ --strain JF1_Ms --skip_filtering
Reference genome folder provided is /bi//scratch/Genomes/Mouse/GRCm38/  (absolute path is '/bi/scratch/Genomes/Mouse/GRCm38/)'

Summarising SNPsplit Genome Preparation Parameters
==================================================
Reading/filtering VCF file:             No (skipped by user)
Reference genome:                       /bi/scratch/Genomes/Mouse/GRCm38/
N-masking:                              Yes
Full SNP genome:                        No
SNP strain:                             JF1_Ms

Using the following chromosomes (HARDCODED IN!!!):
1       2       3       4       5       6       7       8       9       10      11      12      13      14      15      16      17      18      19      X       Y       MT

Skipped reading the VCF file and filtering SNPs again (specified by user)

Now reading in and storing sequence information of the genome specified in: /bi/scratch/Genomes/Mouse/GRCm38/

chr 1 (195471971 bp)
chr 10 (130694993 bp)
...

This uses standard mouse chromosomes 1-9, chrMT, chrX and chrY.

can you just try cloning the current development version and try again?

git clone https://github.com/FelixKrueger/SNPsplit.git

pplaw commented 3 years ago

Hi Felix,

Thank you very much and sorry for not having the most up-to-date SNPsplit to start with (I was on 0.3.2). It is now working fine! Thank you very much!

FelixKrueger commented 3 years ago

Maybe this is also of interest for you, I have just added provisional support for the MGP v7 variants file, which appears to include the strains you are interested in. Please see the description here:

https://github.com/FelixKrueger/SNPsplit/blob/master/CHANGELOG.md#snpsplit_genome_preparation

FelixKrueger commented 2 years ago

As a final comment, I have conducted some sorting followed by IMPLICON processing for B6/JF1 mice, and it worked just fine!