FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
51 stars 19 forks source link

Getting 0 reads processed #74

Closed yonniejon closed 1 year ago

yonniejon commented 1 year ago

These are the commands I am running:

tools/star/STAR-2.7.10b/source/STAR --runThreadN 16 --runMode genomeGenerate --genomeDir GRCm38_index --genomeFastaFiles GRCm38_reference/GRCm38.primary_assembly.genome.fa --sjdbGTFfile GRCm38_reference/gencode.vM23.primary_assembly.annotation.gtf --sjdbOverhang 9

tools/SNPsplit-0.5.0/SNPsplit_genome_preparation --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf.gz --strain CAST_EiJ --strain2 FVB_NJ --reference_genome GRCm38_reference

tools/star/STAR-2.7.10b/source/STAR --genomeDir GRCm38_index/ --runThreadN 16 --readFilesIn $r1_file $r2_file --readFilesCommand zcat --outFileNamePrefix bams_for_snp_split/$name_prefix --alignEndsType EndToEnd --outSAMattributes NH HI NM MD --outSAMtype BAM Unsorted SortedByCoordinate

tools/SNPsplit-0.5.0/SNPsplit --snp_file GRCm38_reference/all_FVB_NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt $f --paired

The bam files are not empty and I have manually investigated various heterozygous sites in the sample using IGV. But the bam files generated by SNPsplit are empty and the SNPsplit report looks like:

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:       0
    thereof were read pairs:            0
    thereof were singletons:            0
Reads were unassignable (not overlapping SNPs):     0 (N/A%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads were specific for genome 1:           0 (N/A%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads were specific for genome 2:           0 (N/A%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads contained conflicting SNP information:        0 (N/A%)
    thereof were read pairs:    0
    thereof were singletons:    0

The output from SNPsplit itself looks like:

Allele-tagging report
=====================
Processed 11872436 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
0 reads were unassignable (0.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads did not contain one of the expected bases at known SNP positions (0.00%)
0 contained conflicting allele-specific SNPs (0.00%)

SNP coverage report
===================
SNP annotation file:    GRCm38_reference/all_FVB_NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total:   20581027
N-containing reads: 0
non-N:          0
total:          11872436
Reads had a deletion of the N-masked position (and were thus called Conflicting):   0 (0.00%)
Of which had multiple deletions of N-masked positions within the same read: 0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:    0 (N/A%)
N was not present in the list of SNPs:      0 (N/A%)

Finished allele-tagging for file <<< 'CastxFVB-20200710-COxCO-01_S40.splitreads.sortedByName.bam' >>>

Any idea why I am getting 0 reads processed? Not sure where to start

FelixKrueger commented 1 year ago

Happy New Year,

and apologies for the late reply. I think the the problem you are facing arises from the fact that your steps 1 and 2 need to be carried out in inverse order.

1) You need to run the SNPsplit genome preparation of the GRCm38 genome using the v5 VCF file. This will generate the N-masked version of the new dual hybrid genome (the folder name should be fairly telling).

2) You need to run STAR genomeGenerate process on the new N-masked sequences.

3) Align your FastQ files to the N-masked STAR index, and then run SNPsplit. This should then all work.

PS: If you are already using the GRCm39 genome version, you should be using the v8 annotations to go with the latest genome version. This will soon be the default.

yonniejon commented 1 year ago

Thanks for the reply!

Ah, I see. Thanks. I will try that.

I am currently using GRCm38 and not 39. Thank you!

Just to make sure I am understanding correctly - SNPsplit_genome_preparation creates an output CAST_EiJ_FVB_NJ_dual_hybrid.based_on_GRCm38_N-masked where inside there are fasta files for each chromosome. I should use those fasta files as the input to Star's genome generate command?

yonniejon commented 1 year ago

I did this and I still got 0 reads processed. This is part of the SNPsplit output:

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:       5484876
    thereof were read pairs:            5459714
    thereof were singletons:            25162
Reads were unassignable (not overlapping SNPs):     5484876 (100.00%)
    thereof were read pairs:    5459714
    thereof were singletons:    25162
Reads were specific for genome 1:           0 (0.00%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads were specific for genome 2:           0 (0.00%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads contained conflicting SNP information:        0 (0.00%)
    thereof were read pairs:    0
    thereof were singletons:    0

Sorting finished successfully

Appending sorting report to tagging report...

SNPsplit processing finished... [Allele-tagging + tag2sort]

I think the problem might be in the SNPsplit genome prepare command?

For both the CAST and the FVB strain the genome_preparation_report.txt looks like:

cat CAST_EiJ_genome_preparation_report.txt                            

Summary
0 Ns were newly introduced into the N-masked genome for strain CAST_EiJ in total
0 SNPs were newly introduced into the full sequence genome version for strain CAST_EiJ in total
Summary
0 Ns were newly introduced into the N-masked genome for strain 2 [FVB_NJ] in total
0 SNPs were newly introduced into the full sequence genome version for strain 2 [FVB_NJ] in total
FelixKrueger commented 1 year ago

Yes, that would certainly be a reason.... Which eventually leads to:

SNP coverage report
===================
SNP annotation file:    GRCm38_reference/all_FVB_NJ_SNPs_CAST_EiJ_reference.based_on_GRCm38.txt
SNPs stored in total:   20581027
N-containing reads: 0
total:          11872436

Can you link the commands you used, and maybe attach the log files? Usually when this happens people do not use the Ensembl genome but an NCBI/UCSC genome (where chromosome names are different). Is there a chance this could be true in your case?

yonniejon commented 1 year ago

Yes. that is the case. I used the genome from gencode here https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz

But I thought it was ok because I downloaded the vcf file from the mouse genomes project linked in the documentation of SNPsplit here and then I added the "chr" prefix to all values in the chromosome column. But I will retry everything with an Ensembl genome. and the original VCF file.

FelixKrueger commented 1 year ago

good luck! Let me know how you get on.

yonniejon commented 1 year ago

It worked! Thank you very much and apologies for the naive questions/issues.

FelixKrueger commented 1 year ago

Excellent!

X201231 commented 4 months ago

It worked! Thank you very much and apologies for the naive questions/issues.

Hi, I met the same problem. How did you solve it in the end? Which version of the genome did you ultimately choose? Thanks a lot!!

FelixKrueger commented 4 months ago

The steps are shown here: https://felixkrueger.github.io/SNPsplit/workflow/

(with a link to the genome to download under point 2)). I hope this helps?

X201231 commented 4 months ago

The steps are shown here: https://felixkrueger.github.io/SNPsplit/workflow/

(with a link to the genome to download under point 2)). I hope this helps?

It helps a lot, thank you. I changed the name of the chromosome in the reference genome to match the vcf file, and it worked.