FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.1.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b #648

Closed wangguiqian closed 5 months ago

wangguiqian commented 6 months ago

This my command : bismark --genome WholeGenomeFasta/ -1 CRR91_f1.fq.gz -2 CRR91_r2.fq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ - -samtools_path /public/home/software/opt/bio/software/SAMtools/1.9/bin/ --bam --hisat2 --prefix CRR91 -o Bismark_CRR91

HISAT2 seems to be working fine (tested command '/public/home/software/opt/bio/software/hisat2/2.2.1/hisat2 --version' [2.2.1]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools path provided as: '/public/home/software/opt/bio/software/SAMtools/1.9/bin/samtools' Reference genome folder provided is WholeGenomeFasta/ (absolute path is '/public/home/yqming/WGBS/WholeGenomeFasta/)' The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.1.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.2.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.3.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.4.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.5.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.6.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.7.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the C->T converted genome seems to be faulty (BS_CT.8.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.1.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.2.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.3.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.4.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.5.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.6.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.7.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b The HISAT2 index of the G->A converted genome seems to be faulty (BS_GA.8.ht2 doesn't exist). Please run bismark_genome_preparation --hisat2 b

Couldn't find a traditional small HISAT2 index for the genome specified (ending in .ht2). Now searching for a large index instead (64-bit inde 64-bit large HISAT2 genome index found... FastQ format assumed (by default)

Input files to be analysed (in current folder '/public/home/yqming/WGBS'): CRR91_f1.fq.gz CRR91_r2.fq.gz Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be igno Created output directory Bismark_CRR91/!

Output will be written into the directory: /public/home/yqming/WGBS/Bismark_CRR91/ Using the following prefix for output files: CRR91

Setting parallelization to single-threaded (default)

Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --no-softclip --omit-sec-seq Current working directory is: /public/home/yqming/WGBS

Now reading in and storing sequence information of the genome specified in: /public/home/yqming/WGBS/WholeGenomeFasta/

FelixKrueger commented 6 months ago

Good morning, just for the record: have you run bismark_genome_preparation --hisat2 WholeGenomeFasta/ to generate the indexes?

wangguiqian commented 6 months ago

yes ,I ues comand : bismark_genome_preparation --large-index --hisat2 --path_to_aligne /public/home/software/opt/bio/software/hisat2/2.2.1/ WholeGenomeFasta/

FelixKrueger commented 6 months ago

Hmm, I wonder if the --large-index is causing the error your are seeing above. Does the genome you have necessitate the generation of a large index?

wangguiqian commented 6 months ago

yes,I first time useing comand : bismark_genome_preparation --hisat2 --path_to_aligne /public/home/software/opt/bio/software/hisat2/2.2.1/ then get result is : my genome is very large ,so suggests useing "--large-index"

wangguiqian commented 6 months ago

Please tell me ,Can I change the "BS_CT..ht2l" to "BS_CT..ht2";then I use above "bismark code" ?

FelixKrueger commented 6 months ago

Oh, hold on, I missed this line: 64-bit large HISAT2 genome index found...

Could it just be that everything is working just fine and, the alignments are running?

FelixKrueger commented 6 months ago

Please tell me ,Can I change the "BSCT..ht2l" to "BSCT..ht2";then I use above "bismark code" ?

That will certainly not work, but as pointed out I believe it might just be running all fine. Can you run top to see?

wangguiqian commented 6 months ago

Oh, hold on, I missed this line: 64-bit large HISAT2 genome index found...

Could it just be that everything is working just fine and, the alignments are running? No , I waite for the run to finish and found that no anything result was generated .

FelixKrueger commented 6 months ago

Can you elaborate what is happening? The index generation appears to have succeeded, and then Bismark only gets to the stage

Now reading in and storing sequence information of the genome specified in: /public/home/yqming/WGBS/WholeGenomeFasta/

? When you say you waited for the run to finish, did it generate an alignment report? Is there a chance the process got killed (e.g. by the scheduler)?

wangguiqian commented 6 months ago

I run the code ,then geting result: CRR91.CRR91_f1_bismark_hisat2_pe.bam CRR91.CRR91_f1_bismark_hisat2_PE_report.txt

wangguiqian commented 6 months ago

2924431657.png.zip

This the photo ,I fell the result file is very small .I guess geting the bad result.

wangguiqian commented 6 months ago

Final Alignment report

Sequence pairs analysed in total: 1243930319 Number of paired-end alignments with a unique best hit: 0 Sequence pairs with no alignments under any condition: 0 Sequence pairs did not map uniquely: 0 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

wangguiqian commented 6 months ago

If I need retrying to run "bismark "

FelixKrueger commented 6 months ago

Oh wow, that is a LOT of reads!

Since you didn't get a single (!) alignment my guess is that the alignment processes got killed due to running out of memory, but this is just a guess.

Which kind of machine are you running this on, and did you see and errors in the log? I would suggest subsetting the reads (after adapter trimming, did you trim with e.g. Trim Galore?), and then trying again. such a test should complete in under a minute, and will tell you the system you are trying to run this on is capable of performing the task at all.

Here is an example:

gunzip CRR91_f1.fq.gz | head -40000 | gzip -c - > CRR91_f1_test_10K.fastq.gz
wangguiqian commented 6 months ago

I retry to use the command : bismark --genome WholeGenomeFasta/ -1 CRR91_f1_test_40K.fastq.gz -2 CRR91_r2_test_40K.fastq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ --samtools_path /public/home/software/opt/bio/software/SAMtools/1.9/bin/ --bam --hisat2 --prefix CRR91 -o Bismark_CRR91

But I retry to run that geting the result photo: 681186244.png.zip

Then I check the file "CRR91.CRR91_f1_test_40K_bismark_hisat2_PE_report.txt" Bismark was run with HISAT2 against the bisulfite genome of /public/home/yqming/WGBS/WholeGenomeFasta/ with the specified options: -q --score- Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)

Final Alignment report

Sequence pairs analysed in total: 10000 Number of paired-end alignments with a unique best hit: 0 Mapping efficiency: 0.0% Sequence pairs with no alignments under any condition: 10000 Sequence pairs did not map uniquely: 0 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 0 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 10m 21s

FelixKrueger commented 6 months ago

This is very odd indeed, can you send me the FastQC report of these files and/or the sequences files themselves to take a look?

FelixKrueger commented 6 months ago

One does not simply.... get 0 alignments for anything really.

This got me thinking, maybe there are some inhibitory sequences present, e.g. UMIs? Or maybe the paired-end files do not match up for some reason? Could you try to run just Read 1 in single-end mode, maybe with. --non_directional? Which type of sequencing library are you using?

As another thought, you could try and use Bowtie2 as the underlying aligner? Is there a specific reason why you are using HISAT2?

wangguiqian commented 6 months ago

ok,can you give me you email ,I send file by email

FelixKrueger commented 6 months ago

If you click on my avatar, you should be able to see it. Could you also mention which genome you are using? Thanks

wangguiqian commented 6 months ago

I just useing github to send my fsatQC report and file

wangguiqian commented 6 months ago

Can help me to check with you email ,just I sened.Thanks your help

FelixKrueger commented 6 months ago

I took a quick at the reads, here are a few things I noticed:

trim_galore --paired --clip_r1 10 --clip_r2 15 --threeprime_clip_r1 10 --three_prime_clip_r2 10 file_R1.fastq.gz file1_R2.fastq.gz

See more here: See more here: https://felixkrueger.github.io/Bismark/bismark/library_types/

My recommendation would be to use Trim Galore using the command above, and then run it again (maybe with the test dataset to start with). I am sure this will sort it out.

wangguiqian commented 5 months ago

Thanks for your help, but I followed your suggestion and used the following commands: trim_galore --paired --clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10 --illumina CRR91_f1.fq.gz CRR91_r2.fq.gz -o Trim_CRR91 But when it comes to the bismark processing step, the same result still appears. I feel really strange. I followed your suggestion two days ago. Is my data caused by the fact that the ID numbers of f1 reads and r2 reads in paired-end sequencing do not match? However, filtering usually results in incomplete matching of paired-end reads. What is the reason for this problem?

image

The photo is the bismark to run :

image

Bismark report for: CRR793891_f1_val_1.fq.gz and CRR793891_r2_val_2.fq.gz (version: v0.23.1) Bismark was run with HISAT2 against the bisulfite genome of /public/home/yqming/WGBS/WholeGenomeFasta/ with the specified options: -q --score- Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)

Final Alignment report

Sequence pairs analysed in total: 557310521 Number of paired-end alignments with a unique best hit: 0 Mapping efficiency: 0.0% Sequence pairs with no alignments under any condition: 557310521 Sequence pairs did not map uniquely: 0 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 0 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

FelixKrueger commented 5 months ago

I think there are currently two issues that need addressing. Firstly, you don't know whether your file(s) align to the genome at all, and secondly, there might be an issue with the paired-end nature of your files. I would to the following:

  1. As a test for data with the genome you are using, I would take just the a few lines of a trimmed file, and run the alignments in single-end (SE) mode:
gunzip -c CRR793891_f1_val_1.fq.gz | head -40000 | gzip -c - CRR793891_10Ktest_R1.fq.gz

bismark --genome /public/home/yqming/WGBS/WholeGenomeFasta/ CRR793891_10Ktest_R1.fq.gz

This should yield a good mapping efficiency and results.

  1. Q: " Is my data caused by the fact that the ID numbers of f1 reads and r2 reads in paired-end sequencing do not match?"

The read IDs don't have to match, but the order of the sequences have to be in sync. Both R1 and R2 files need to have the exactly same number of reads, and be in the correct order (and orientation). As a quick test you could look at the first and sequences of your file, but this obviously doesn't mean that the files are correct everywhere. Still, it's a (relatively) quick thing to check:

gunzip -c CRR793891_f1_val_1.fq.gz | head -8
gunzip -c CRR793891_f1_val_1.fq.gz | tail -8

gunzip -c CRR793891_r2_val_2.fq.gz | head -8
gunzip -c CRR793891_r2_val_2.fq.gz | tail -8

If you find that the reads are not in-sync you need to go back to the very raw FastQ files (make sure they are correct), and use those with Trim Galore. Trim Galore validates read pairs, so will never remove files from one file but not the other.

wangguiqian commented 5 months ago

Can I send my fastq files to you? I check the ID number of f1 reads and r2 reads in paired-end sequencing is matched.

Thanks for you help me.

wangguiqian commented 5 months ago
image

Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/public/home/software/opt/bio/software/SAMtools/1.9/bin/samtools' Reference genome folder provided is /public/home/yqming/WGBS/WholeGenomeFasta/ (absolute path is '/public/home/yqming/WGBS/WholeGenomeFasta/) ' The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.2.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.3.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.4.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.1.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.2.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.1.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.2.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.3.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.4.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.1.bt2'). Please run bismark_genome_preparation before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.2.bt2'). Please run bismark_genome_preparation before running Bismark

Couldn't find a traditional small Bowtie 2 index for the genome specified (ending in .bt2). Now searching for a large index instead (64-bit in dex ending in .bt2l)... The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2l'). Please run the bismark_genome_preparation before running Bismark

wangguiqian commented 5 months ago
image

The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.2.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.3.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.4.bt2'). Please run the bismark_genome_preparation before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.1.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.rev.2.bt2'). Please run the bismark_genome_preparat ion before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.1.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.2.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.3.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.4.bt2'). Please run bismark_genome_preparation befo re running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.1.bt2'). Please run bismark_genome_preparation before running Bismark The Bowtie 2 index of the G->A converted genome seems to be faulty or non-existant ('BS_GA.rev.2.bt2'). Please run bismark_genome_preparation before running Bismark

Couldn't find a traditional small Bowtie 2 index for the genome specified (ending in .bt2). Now searching for a large index instead (64-bit in dex ending in .bt2l)... The Bowtie 2 index of the C->T converted genome seems to be faulty or non-existant ('BS_CT.1.bt2l'). Please run the bismark_genome_preparation before running Bismark

wangguiqian commented 5 months ago

I useing command : bismark --genome ../WholeGenomeFasta/ CRR793891_10Ktest_R1 .fq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ --samtools_path /public/home/software/opt/bio/software/SAMtools/1 .9/bin/ --bam --hisat2 --prefix CRR91_10k -o Bismark_CRR91_10k

I want to use SE model for my CRR793891_10Ktest_R1.fq.gz files.

FelixKrueger commented 5 months ago

Is there a chance the (duplicated) error messages above are not from the last command? I find it hard to believe that it complains about not finding a Bowtie2 index but you've got --hisat2 in your command....

And yes, you can send the untrimmed, small subset FastQ files.

wangguiqian commented 5 months ago

I send you by email. help me to check.

wangguiqian commented 5 months ago

Is there a chance the (duplicated) error messages above are not from the last command? I find it hard to believe that it complains about not finding a Bowtie2 index but you've got --hisat2 in your command....

And yes, you can send the untrimmed, small subset FastQ files.

I understand I useing "--hisat2"

wangguiqian commented 5 months ago

I useing command: "bismark --genome ../WholeGenomeFasta/ CRR793891_10Ktest_R1.fq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ --samtools_path /public/home/software/opt/bio/software/SAMtools/1.9/bin/ --bam --hisat2 --prefix CRR91_10k -o Bismark_CRR91_10k"

But I get result photo:

image

Bismark report for: CRR793891_10Ktest_R1.fq.gz (version: v0.23.1)
Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed) Bismark was run with HISAT2 against the bisulfite genome of /public/home/yqming/WGBS/WholeGenomeFasta/ with the specified options: -q --score-

Final Alignment report

Sequences analysed in total: 10000 Number of alignments with a unique best hit from the different alignments: 0 Mapping efficiency: 0.0% Sequences with no alignments under any condition: 10000 Sequences did not map uniquely: 0 Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 0 ((converted) top strand) CT/GA: 0 ((converted) bottom strand) GA/CT: 0 (complementary to (converted) top strand) GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 6m 50s

FelixKrueger commented 5 months ago

OK good, this completes fine, albeit without alignments. Which genome are you using?

wangguiqian commented 5 months ago

I geneme file is very large. I useing genome name :ibetan wild tree peony genome (Paeonia ludlowii)

wangguiqian commented 5 months ago

I have checked the reads of f1 and r2 of my fastq file. Their ID numbers can basically correspond completely, but the reads processed by quality control only have a small number of ID numbers that cannot correspond. So I feel very strange about this question. This has never happened to me when I used it before.

wangguiqian commented 5 months ago

But there was no problem at all when I used Hisat2 alone for alignment. I also used the original genome and original reads for alignment. So I hope to get your help.

wangguiqian commented 5 months ago

I useing command : hisat2 -p 5 -x WholeGenomeFasta/Bisulfite_Genome/CT_conversion/BS_CT -1 CRR91_f1_test_40K.fastq.gz -2 CRR91_r2_test_40K.fastq.gz -S test_40k.sam

image
FelixKrueger commented 5 months ago

processed by quality control only have a small number of ID numbers that cannot correspond.

Not sure I understand this sentence, can you please clarify?

Have you tried the single-end aligment with Bismark?

I am in the process of carrying out some tests here, but I need to download and index the genome first....

wangguiqian commented 5 months ago

processed by quality control only have a small number of ID numbers that cannot correspond.

Not sure I understand this sentence, can you please clarify?

Have you tried the single-end aligment with Bismark?

I am in the process of carrying out some tests here, but I need to download and index the genome first....

Yes ,I uesing sing-end reads to aligment ,the my command : bismark --genome ../WholeGenomeFasta/ CRR793891_10Ktest_R1.fq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ --samtools_path /public/home/software/opt/bio/software/SAMtools/1.9/bin/ --bam --hisat2 --prefix CRR91_10k -o Bismark_CRR91_10k

FelixKrueger commented 5 months ago

and, does that work?

wangguiqian commented 5 months ago

and, does that work?

I useing command : hisat2 -p 5 -x WholeGenomeFasta/Bisulfite_Genome/CT_conversion/BS_CT -1 CRR91_f1_test_40K.fastq.gz -2 CRR91_r2_test_40K.fastq.gz -S test_40k.sam When I use Hisat2 alone, I can run it and get the result. The result of the hisat2 comparison above is. But it still doesn't work using bismaek.

FelixKrueger commented 5 months ago

I meant does the single-end run work?

Yes ,I uesing sing-end reads to aligment ,the my command :
bismark --genome ../WholeGenomeFasta/ CRR793891_10Ktest_R1.fq.gz --path_to_hisat2 /public/home/software/opt/bio/software/hisat2/2.2.1/ --samtools_path /public/home/software/opt/bio/software/SAMtools/1.9/bin/ --bam --hisat2 --prefix CRR91_10k -o Bismark_CRR91_10k
wangguiqian commented 5 months ago

Using single-end comparison also gets the same result, which has no effect.

The result : Bismark report for: CRR793891_10Ktest_R1.fq.gz (version: v0.23.1) Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed) Bismark was run with HISAT2 against the bisulfite genome of /public/home/yqming/WGBS/WholeGenomeFasta/ with the specified options: -q --score-

Final Alignment report Sequences analysed in total: 10000 Number of alignments with a unique best hit from the different alignments: 0 Mapping efficiency: 0.0% Sequences with no alignments under any condition: 10000 Sequences did not map uniquely: 0 Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output: CT/CT: 0 ((converted) top strand) CT/GA: 0 ((converted) bottom strand) GA/CT: 0 (complementary to (converted) top strand) GA/GA: 0 (complementary to (converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in Unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 6m 50s

FelixKrueger commented 5 months ago

The pony indexing is still running, but roughly 1% aligns to the Lambda genome, and there are also at leas some reads on there that align to Arabidopsis thaliana, even up to 10% in single-end, local mode (Bowtie2), with bisulfite conversion:

Successfully deleted the temporary file CRR793891_10Ktest_R1_val_1.fq.gz_C_to_T.fastq

Final Alignment report
======================
Sequences analysed in total:    9988
Number of alignments with a unique best hit from the different alignments:      1044
Mapping efficiency:     10.5%

Sequences with no alignments under any condition:       8684
Sequences did not map uniquely: 260
Sequences which were discarded because genomic sequence could not be extracted: 0

Number of sequences with unique best (first) alignment came from the bowtie output:
CT/CT:  498     ((converted) top strand)
CT/GA:  546     ((converted) bottom strand)
GA/CT:  0       (complementary to (converted) top strand)
GA/GA:  0       (complementary to (converted) bottom strand)
...

C methylated in CpG context:    10.6%
C methylated in CHG context:    13.5%
C methylated in CHH context:    3.1%
C methylated in Unknown context (CN or CHN):    7.9%

Interstingly, the same command but with HISAT2 produced only 0.4% mapping (vs 10% with Bowtie2, minimap2 was at 7.6%), not quite sure what Bowtie2 is doing better here....

Preliminary conclusion

So for the time being I would suspect that the sample is either contaminated with something else, possibly a plant genome, or maybe the sample was accidentally swapped? The list of genomes I tested with FastQ Screen in --bisulfite mode did not reveal any major contaminant (e.g. human or PhiX), but alignments to the horse genome also yielded 0% alignments.

Are there any other organisms that are handled in your lab, or your institute, or were there other samples with unexpected results in the sequencing facility you used?

wangguiqian commented 5 months ago

I also find it very strange that I directly use Hisat2 to align the plant genome I am studying and the alignment is normal, but why do errors occur when using bismark?

FelixKrueger commented 5 months ago

what plant is that exactly? That would be a good candidate....

wangguiqian commented 5 months ago

the plant name is Tibetan wild tree peony genome (Paeonia ludlowii)

FelixKrueger commented 5 months ago

ah right, that's my bad, I had assumed it was a typo of 'pony' (which is why I tried the horse genome) :P

wangguiqian commented 5 months ago

ah right, that's my bad, I had assumed it was a typo of 'pony' (which is why I tried the horse genome) :P

No problem .Just i look you useing horse genome (pony) ,I feel errors .I useing Tibetan wild tree peony genome

FelixKrueger commented 5 months ago

This genome is really big, the indexing is still runnin.... I downloaded it from here: https://figshare.com/articles/dataset/Paeonia_ludlowii_genome_annotation/23537670. Is that the best option?

Interestingly, when using minimap2 for indexing it says:

minimap2 -k 20 -t 2 -d BS_GA.mmi genome_mfa.GA_conversion.fa

[WARNING] failed to parse the first FASTA/FASTQ record. Continue anyway.
[WARNING] failed to parse the first FASTA/FASTQ record. Continue anyway.
[WARNING] failed to parse the first FASTA/FASTQ record. Continue anyway.
[M::mm_idx_gen::91.113*0.12] collected minimizers
[M::mm_idx_gen::91.114*0.12] sorted minimizers
[M::main::91.120*0.12] loaded/built the index for 0 target sequence(s)
[M::mm_idx_stat] kmer size: 20; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::91.120*0.12] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan; total length: 0
[WARNING] failed to parse the first FASTA/FASTQ record. Continue anyway.

Maybe something is mis-formatted in the genome file?