FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Number of bisulfite transformed reads are not equal between Reads 1 and 2 #657

Open itsjoeka opened 4 months ago

itsjoeka commented 4 months ago

Dear Bismark Author, I was running my data with the bismark tool and i got the error message below. the version of bismark is: version v0.24.2 and the code i used to run is: /home/jkalami/.conda/envs/bismark/bin/./bismark --non_directional --parallel 8 -f /media/biodataD/omics_project/bisulphite-dataset/ref -1 SRR10493735_1.fa -2 SRR10493735_2.fa report.txt

ERROR MESSAGE: Writing a C -> T converted version of the input file SRR10493735_2.fa.temp.2 to SRR10493735_2.fa.temp.2_C_to_T.fa Writing a G -> A converted version of the input file SRR10493735_2.fa.temp.2 to SRR10493735_2.fa.temp.2_G_to_A.fa

Created C -> T as well as G -> A converted versions of the FastA file SRR10493735_2.fa.temp.6 (49610851 sequences in total)

[FATAL ERROR]: Number of bisulfite transformed reads are not equal between Read 1 (#49602128) and Read 2 (#49610851). Possible causes: file truncation, or as a result of specifying read pairs that do not belong to each other?! Please re-specify file names! Exiting...

I hope to hear from you soon. Kind regards Jonathan.

FelixKrueger commented 4 months ago

Hi Jonathan,

I would say this is the first time in well over a decade that I see anyone using FastA files for alignment. I am wondering how you got hold of FastA files? The typical procedure would involve downloading the data in FastQ format:

Accession   Title   Platform    Total bases Create date SRA URL SRA filename    SRA nice filename   FastQ URL   FastQ Aspera URL    FastQ filename  FastQ nice filename
SRR10493735 GSM4176376: C1; Homo sapiens; Bisulfite-Seq Illumina HiSeq 2000 1190884 1574899200000   ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR104/SRR10493735/SRR10493735.sra SRR10493735.sra SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq.sra    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/035/SRR10493735/SRR10493735_1.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/035/SRR10493735/SRR10493735_1.fastq.gz    SRR10493735_1.fastq.gz  SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq_1.fastq.gz
SRR10493735 GSM4176376: C1; Homo sapiens; Bisulfite-Seq Illumina HiSeq 2000 1190884 1574899200000   ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR104/SRR10493735/SRR10493735.sra SRR10493735.sra SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq.sra    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR104/035/SRR10493735/SRR10493735_2.fastq.gz    era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR104/035/SRR10493735/SRR10493735_2.fastq.gz    SRR10493735_2.fastq.gz  SRR10493735_GSM4176376_C1_Homo_sapiens_Bisulfite-Seq_2.fastq.gz

You would then determine which type of library it is, and run the data through Trim Galore (FastQ required). Once that is done you would proceed to the mapping step.

And just to make sure that the FastA files you have both have the same number of reads could you run the following to count the number of lines?

for i in *SRR10493735*.fa; do echo $i; gunzip -c $i | wc -l; done

I should probably stress that I would recommend the FastQ steps described above. Also as a word of warning: --parallel 8 will probably consume north of 120GB of RAM for the alignment, you might want to tune this down a little....

itsjoeka commented 4 months ago

Hi Mr. Krueger, Thanks for such a quick reply. First, the workflow I am using for this project directed me to use the fasta files. I got hold of the fastA files by downloading the fastQ files from the SRA explorer like you described above, trimmed the adapters with trimgalore and converted them into the fastA format using the "awk" command. Thank you for the advice, I will try with the fastQ step and let you know how it goes. kinds regards jonathan.

itsjoeka commented 3 months ago

Hi Felix, I hope this message finds you well. I resorted to using the fastq files for the aligment process. I am using hg38.fa as the reference genome for the analysis. Before the alignment step, I prepared the reference genome using the command; bismark_genome_preparation --bowtie2 --verbose /bisulphite-dataset/ref/, which to successful completion. Afterwards I moves to the alignment step by running the command; bismark --non_directional -f /bisulphite-dataset/ref -1 SRR10493736_1_val_1.fq.gz -2 SRR10493736_2_val_2.fq.gz report.txt. Unforturnately, the process terminated and the screen read;

chr chrUn_KI270448v1 (7992 bp) chr chrUn_KI270521v1 (7642 bp) chr chrUn_GL000195v1 (182896 bp) chr chrUn_GL000219v1 (179198 bp) chr chrUn_GL000220v1 (161802 bp) chr chrUn_GL000224v1 (179693 bp) chr chrUn_KI270741v1 (157432 bp) chr chrUn_GL000226v1 (15008 bp) chr chrUn_GL000213v1 (164239 bp) chr chrUn_KI270743v1 (210658 bp) chr chrUn_KI270744v1 (168472 bp) chr chrUn_KI270745v1 (41891 bp) chr chrUn_KI270746v1 (66486 bp) chr chrUn_KI270747v1 (198735 bp) chr chrUn_KI270748v1 (93321 bp) chr chrUn_KI270749v1 (158759 bp) chr chrUn_KI270750v1 (148850 bp) chr chrUn_KI270751v1 (150742 bp) chr chrUn_KI270752v1 (27745 bp) chr chrUn_KI270753v1 (62944 bp) chr chrUn_KI270754v1 (40191 bp) chr chrUn_KI270755v1 (36723 bp) chr chrUn_KI270756v1 (79590 bp) chr chrUn_KI270757v1 (71251 bp) chr chrUn_GL000214v1 (137718 bp) chr chrUn_KI270742v1 (186739 bp) chr chrUn_GL000216v2 (176608 bp) chr chrUn_GL000218v1 (161147 bp) chr chrX (156040895 bp) chr chrY (57227415 bp) chr chrY_KI270740v1_random (37240 bp)

Single-core mode: setting pid to 1

Paired-end alignments will be performed

The provided filenames for paired-end alignments are /hicanu/canu-scripts/trimmed/SRR10493736_1_val_1.fq.gz and /hicanu/canu-scripts/trimmed/SRR10493736_2_val_2.fq.gz Input files are in FastA format Writing a C -> T converted version of the input file SRR10493736_1_val_1.fq.gz to SRR10493736_1_val_1.fq.gz_C_to_T.fa Writing a G -> A converted version of the input file SRR10493736_1_val_1.fq.gz to SRR10493736_1_val_1.fq.gz_G_to_A.fa Input file doesn't seem to be in FastA format at sequence 1: Inappropriate ioctl for device.

The error says my input file is not in the fastA format. Kindly help me know where i went wrong and what i can do to resolve the issue. Kind regards Jonathan.

FelixKrueger commented 3 months ago

Your command contains the flag -f which instructs Bismark to expect FastA files. Just drop it, and all will be fine.

itsjoeka commented 3 months ago

Hello Mr. Felix, Thanks for the wonderful feedbacks, they have helped me start the alignment process which is currently running. I am using the hg38.fa as the reference genome for the alignment. A little concern though, on the screen where I am running the alignment, I see these texts which I have pasted below, and it has been like this since after the c -> T and G -> A conversion step. Should I be concerned even though the process is still running and files being created. the command line I issued was;

bismark --non_directional /home/jupyter/epilepsyvar/dataset/ref_genome/ -1 SRR10493735_1_val_1.fq.gz -2 SRR10493735_2_val_2.fq.gz report>

Chromosomal sequence could not be extracted for SRR10493735.33103437_33103437_length=150 chrM 1 Chromosomal sequence could not be extracted for SRR10493735.33743215_33743215_length=150 chrM 2 Chromosomal sequence could not be extracted for SRR10493735.33744222_33744222_length=150 chrM 2 Processed 34000000 sequence pairs so far Processed 35000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.35930829_35930829_length=150 chrM 2 Processed 36000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.36648650_36648650_length=150 chr1_KI270709v1_random 1 Chromosomal sequence could not be extracted for SRR10493735.36653440_36653440_length=150 chr1_KI270709v1_random 1 Processed 37000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.37699908_37699908_length=150 chrM 1 Processed 38000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.38295370_38295370_length=150 chrM 2 Chromosomal sequence could not be extracted for SRR10493735.38296453_38296453_length=150 chrM 2 Chromosomal sequence could not be extracted for SRR10493735.38411063_38411063_length=150 chrM 1 Processed 39000000 sequence pairs so far Processed 40000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.40291152_40291152_length=150 chrM 1 Chromosomal sequence could not be extracted for SRR10493735.40847076_40847076_length=150 chrUn_KI270590v1 4535 Processed 41000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.41399548_41399548_length=150 chrUn_KI270529v1 2 Chromosomal sequence could not be extracted for SRR10493735.41534559_41534559_length=150 chr1_KI270710v1_random 40027 Processed 42000000 sequence pairs so far Processed 43000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.43082757_43082757_length=150 chrM 16425 Chromosomal sequence could not be extracted for SRR10493735.43751028_43751028_length=150 chrM 2 Chromosomal sequence could not be extracted for SRR10493735.43762911_43762911_length=150 chrM 2 Processed 44000000 sequence pairs so far Processed 45000000 sequence pairs so far Processed 46000000 sequence pairs so far Processed 47000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.47831134_47831134_length=150 chrUn_KI270529v1 1 Processed 48000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.48678792_48678792_length=150 chrM 1 Chromosomal sequence could not be extracted for SRR10493735.48679160_48679160_length=150 chrM 1 Chromosomal sequence could not be extracted for SRR10493735.48975409_48975409_length=150 chrM 2 Processed 49000000 sequence pairs so far Processed 50000000 sequence pairs so far Chromosomal sequence could not be extracted for SRR10493735.50032426_50032426_length=150 chrM 1 Chromosomal sequence could not be extracted for SRR10493735.50075308_50075308_length=150 chrM 2 Processed 51000000 sequence pairs so far

FelixKrueger commented 3 months ago

That's all fine, you just have a few reads that align to the very end of the mitochondrium. Just keep going!