trouble with BS-Seq - Githubissues

lucacozzuto commented 6 months ago

Hello, many thanks for this tool, it is very beneficial for our research! I was trying to use it for a BS-Seq experiment but for some reason, I always got no assignation of reads to the two parent genomes. I used Bismark for alignment and then SNPsplit. I got a nice result with RNAseq and STAR as the aligner but I don't know why I get no reads for BS-Seq.

So, I'm wondering if you can help me with this. I'm using this sample as test case: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR15829164&display=metadata

that is a mix of 129S1_SvImJ and CAST_EiJ.

Best and thanks in advance!

Luca

FelixKrueger commented 6 months ago

Hi Luca,

Did you follow the instructions to donwload the the MGP genome SNP file, genome preparation on the Ensembl genome, followed by alignments to that N-masked dual genome?

lucacozzuto commented 6 months ago

Hi. Is it different for the RNAseq one? I used the ensembl reference and downloaded the huge SNP file... I then used Bismarck for genome preparation using the SNP masked genome as explained

Luca

FelixKrueger commented 6 months ago

yes, that should be the one. Can you link the commands and files you used, and also the text you see on screen?

lucacozzuto commented 6 months ago

SNP split

SNPsplit_genome_preparation  --vcf_file mgp.v5.merged.snps_all.dbSNP142.vcf.gz --reference_genome ./genome/ --dual_hybrid --strain 129S1_SvImJ --strain2 CAST_EiJ

End of the log:

Summary
20563466 Ns were newly introduced into the N-masked genome for strain/strain 2 [129S1_SvImJ/CAST_EiJ] in total
20563466 SNPs were newly introduced into the full sequence genome version for strainstrain 2 [129S1_SvImJ/CAST_EiJ] in total

All done. Genome(s) are now ready to be indexed with your favourite aligner!
FYI, aligners shown to work with SNPsplit are Bowtie2, STAR, HISAT2, HiCUP and Bismark (STAR and Hisat2 require disabling soft-clipping, please check the SNPsplit manual for details)

chr 1 (195471971 bp)
chr 10 (130694993 bp)
chr 11 (122082543 bp)
chr 12 (120129022 bp)
chr 13 (120421639 bp)
chr 14 (124902244 bp)
chr 15 (104043685 bp)
chr 16 (98207768 bp)
chr 17 (94987271 bp)
chr 18 (90702639 bp)
chr 19 (61431566 bp)
chr 2 (182113224 bp)
chr 3 (160039680 bp)
chr 4 (156508116 bp)
chr 5 (151834684 bp)
chr 6 (149736546 bp)
chr 7 (145441459 bp)
chr 8 (129401213 bp)
chr 9 (124595110 bp)
chr MT (16299 bp)
chr X (171031299 bp)
chr Y (91744698 bp)

genome preparation

bismark_genome_preparation --single_fasta ./genome/

End of the log:

Renaming BS_CT.3.bt2.tmp to BS_CT.3.bt2
Renaming BS_CT.4.bt2.tmp to BS_CT.4.bt2
Renaming BS_CT.1.bt2.tmp to BS_CT.1.bt2
Renaming BS_CT.2.bt2.tmp to BS_CT.2.bt2
Renaming BS_CT.rev.1.bt2.tmp to BS_CT.rev.1.bt2
Renaming BS_CT.rev.2.bt2.tmp to BS_CT.rev.2.bt2

=========================================

Parallel genome indexing complete. Enjoy!

I tested 10 million reads from the sample SRR15829164

bismark --genome ./snpslitp_db/genome -1 ../test_rna/10M/test_10M_1.fq.gz -2 ../test_rna/10M/test_10M_2.fq.gz --bam --parallel 8

End of the log

Final Alignment report
======================
Sequence pairs analysed in total:   10000000
Number of paired-end alignments with a unique best hit: 5222027
Mapping efficiency: 52.2%

Sequence pairs with no alignments under any condition:  3793223
Sequence pairs did not map uniquely:    984750
Sequence pairs which were discarded because genomic sequence could not be extracted:    1

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:   2143850 ((converted) top strand)
GA/CT/CT:   0   (complementary to (converted) top strand)
GA/CT/GA:   0   (complementary to (converted) bottom strand)
CT/GA/GA:   3078176 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   300753094

Total methylated C's in CpG context:    29986455
Total methylated C's in CHG context:    66354948
Total methylated C's in CHH context:    203380109
Total methylated C's in Unknown context:    93185

Total unmethylated C's in CpG context:  198674
Total unmethylated C's in CHG context:  183672
Total unmethylated C's in CHH context:  649236
Total unmethylated C's in Unknown context:  24915

C methylated in CpG context:    99.3%
C methylated in CHG context:    99.7%
C methylated in CHH context:    99.7%
C methylated in Unknown context (CN or CHN):    78.9%

Deleting temporary report files...
test_10M_1.fq.gz.temp.1_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.temp.2_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.temp.3_bismark_bt2_
PE_report.txt   test_10M_1.fq.gz.temp.4_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.temp.5_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.tem
p.6_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.temp.7_bismark_bt2_PE_report.txt   test_10M_1.fq.gz.temp.8_bismark_bt2_PE_report.txt   

Bismark completed in 0d 0h 26m 31s

====================
Bismark run complete
====================

SNPsplit

SNPsplit --paired --snp_file all_CAST_EiJ_SNPs_129S1_SvImJ_reference.based_on_GRCm39.txt.gz --bisulfite -o ./ test_10M_1_bismark_bt2_pe.bam

Log


Output will be written into the directory: ./anno/
Testing if input file 'test_10M_1_bismark_bt2_pe.bam' looks like a Bisulfite-Seq file
File looks like a Bismark paired-end file. Setting '--bisulfite' and '--paired'...

Now testing Bismark result file test_10M_1_bismark_bt2_pe.bam for positional sorting
File appears to be in the right format (Read1 and Read2 following each other), setting '--no_sort'...
Now starting to process file <<< 'test_10M_1_bismark_bt2_pe.bam' >>> 

Input file:                 'test_10M_1_bismark_bt2_pe.bam'
Writing SNPplit-tag report to:          'test_10M_1_bismark_bt2_pe.SNPsplit_report.txt'
Writing allele-flagged output file to:      'test_10M_1_bismark_bt2_pe.allele_flagged.bam'

Summary of parameters for SNPsplit-tag:
========================================
SNPsplit infile:        test_10M_1_bismark_bt2_pe.bam
SNP annotation file:        all_CAST_EiJ_SNPs_129S1_SvImJ_reference.based_on_GRCm39.txt.gz
Output directory:       >/nfs/software/bi/biocore_tools/git/nextflow_dls2/new/AS_analysis/anno/<
Parent directory:       >/nfs/software/bi/biocore_tools/git/nextflow_dls2/new/AS_analysis/anno<
Samtools path:          /usr/local/bin/samtools
Output format:          BAM (default)
Input format:           Bismark paired-end (not relevant for tagging process)

File was specified to be sorted (i.e. Read 1 and Read 2 are following each other in the paired-end BAM file), thus skipping the sorting step

Storing SNP positions provided in 'all_CAST_EiJ_SNPs_129S1_SvImJ_reference.based_on_GRCm39.txt.gz'
Stored 6523 positions in total

Reading from sorted mapping file 'test_10M_1_bismark_bt2_pe.bam'
Processed 1000000 lines so far
Processed 2000000 lines so far
Processed 3000000 lines so far
Processed 4000000 lines so far
Processed 5000000 lines so far
Processed 6000000 lines so far
Processed 7000000 lines so far
Processed 8000000 lines so far
Processed 9000000 lines so far
Processed 10000000 lines so far

Allele-tagging report
=====================
Processed 10444052 read alignments in total
Reads were unaligned and hence skipped: 0 (0.00%)
Reads were hard-clipped (CIGAR: H) and skipped: 0
10444051 reads were unassignable (100.00%)
0 reads were specific for genome 1 (0.00%)
0 reads were specific for genome 2 (0.00%)
0 reads that were unassignable contained C>T SNPs preventing the assignment
15 reads did not contain one of the expected bases at known SNP positions (0.00%)
1 contained conflicting allele-specific SNPs (0.00%)

SNP coverage report
===================
SNP annotation file:    all_CAST_EiJ_SNPs_129S1_SvImJ_reference.based_on_GRCm39.txt.gz
SNPs stored in total:   6523
N-containing reads: 15
non-N:          10444036
total:          10444052
Reads had a deletion of the N-masked position (and were thus called Conflicting):   1 (0.00%)
Of which had multiple deletions of N-masked positions within the same read: 0 (0.00%)

Of valid N containing reads,
N was present in the list of known SNPs:    0 (0.00%)
Positions were skipped since they involved C>T SNPs:    0
N was not present in the list of SNPs:      60 (100.00%)

Finished allele-tagging for file <<< 'test_10M_1_bismark_bt2_pe.bam' >>> 

Summary of parameters for SNPsplit-sort:
========================================
SNPsplit tagged infile:     test_10M_1_bismark_bt2_pe.allele_flagged.bam
Output directory:       >/nfs/software/bi/biocore_tools/git/nextflow_dls2/new/AS_analysis/anno/<
Parent directory:       >/nfs/software/bi/biocore_tools/git/nextflow_dls2/new/AS_analysis/anno<
Samtools path:          /usr/local/bin/samtools
Output format:          BAM (default)
Input format:           Paired-end

Writing YAML report to test_10M_1_bismark_bt2_pe.SNPsplit_sort.yaml

Input file:                     'test_10M_1_bismark_bt2_pe.allele_flagged.bam'
Writing SNPsplit-sort report to:            'test_10M_1_bismark_bt2_pe.SNPsplit_sort.txt'
Writing unassigned reads to:                'test_10M_1_bismark_bt2_pe.unassigned.bam'
Writing genome 1-specific reads to:         'test_10M_1_bismark_bt2_pe.genome1.bam'
Writing genome 2-specific reads to:         'test_10M_1_bismark_bt2_pe.genome2.bam'
Processed 1000000 lines so far
Processed 2000000 lines so far
Processed 3000000 lines so far
Processed 4000000 lines so far
Processed 5000000 lines so far
Last read was a read pair which has already been processed. all done

Allele-specific paired-end sorting report
=========================================
Read pairs/singletons processed in total:       5222026
    thereof were read pairs:            5222026
    thereof were singletons:            0
Reads were unassignable (not overlapping SNPs):     5222025 (100.00%)
    thereof were read pairs:    5222025
    thereof were singletons:    0
Reads were specific for genome 1:           0 (0.00%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads were specific for genome 2:           0 (0.00%)
    thereof were read pairs:    0
    thereof were singletons:    0
Reads contained conflicting SNP information:        1 (0.00%)
    thereof were read pairs:    1
    thereof were singletons:    0

Sorting finished successfully

Appending sorting report to tagging report...

SNPsplit processing finished... [Allele-tagging + tag2sort]
`
``

FelixKrueger commented 6 months ago

From a technical perspective all programs seem to have run successfully. I noticed that you used the outdated v5 of the mouse genomes project genome (see this legacy (v5/GRCm38) page.

The current version SNPsplit is designed for the GRCm39 genome build, using v8 annotations: https://felixkrueger.github.io/SNPsplit/genome_prep/genome_preparation/.

Is there a change you used the v5 annotations in combination with the latest genome?

I would definitely recommend you download the v8 annotation file and give it another try.

lucacozzuto commented 6 months ago

Hi, I'm using the version GRCm38.68 to be consistent with our previous analysis... As I said testing RNAseq samples was ok, but the bisulfite part is not working... just wondering what is going on. Btw, do you have or can you recommend a test dataset I could use for checking wether I'm doing something wrong?

Luca

FelixKrueger commented 6 months ago

Hmm, it seems to have to with the fact that your SNP list is so small:

SNP coverage report
===================
SNP annotation file:    all_CAST_EiJ_SNPs_129S1_SvImJ_reference.based_on_GRCm39.txt.gz
SNPs stored in total:   6523

There should be many millions of SNPs... Maybe you could try and find an older version of SNPsplit and try that? Could you also send me few sample reads of your data and try it out over here? say 200K raw reads should be fine. Cheers, Felix

lucacozzuto commented 6 months ago

Ouch... I think the huge SNP file was damaged... thanks for pointing this out, I'll re-download and check it

lucacozzuto commented 6 months ago

It is working. We can close this. Thanks again!

FelixKrueger / SNPsplit

trouble with BS-Seq #80