genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
236 stars 71 forks source link

paired reads have different names #18

Open Mahmoudbassuoni opened 1 year ago

Mahmoudbassuoni commented 1 year ago

Hi, I am trying to run the alignment using bwa mem for the 2 files _"U0a_CGATGT_L001_R1001.fastq.gz" _"U0a_CGATGT_L001_R2001.fastq.gz" I already got from the FTP site with the reference _"GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz"_ and the command I am using is bwa mem -t 16 -R '@RG\tID:H814YADXX.5.CGATGT.1101\tSM:HG001\tPL:illumina' GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz U0a_CGATGT_L001_R1_001.fastq.gz U0a_CGATGT_L001_R2_001.fastq.gz | samtools view -b - >HG001.GRCh38_no_alt_analysis_set.bam

but I am getting an error with the sequence headers:

[mem_sam_pe]` paired reads have different names: "HWACAGATTTTGT", "HWI-D00360:5:H814YADXX:1:1102:11719:83283" [mem_sam_pe] paired reads have different names: "HWACTATTDDD", "HWI-D00360:5:H814YADXX:1:1102:11293:83492" [mem_sam_pe] paired reads have different names: "@@FAAA(+:A0&AA", "HWI-D00360:5:H814YADXX:1:1102:11730:83321" [mem_sam_pe] paired reads have different names: "HWI-A@HWI-D00360:5:H814YX:1:1102:10399:83348", "HWI-00360:5:H814YADXX:1:1102:11699:83300" [mem_sam_pe] paired reads have different names: "ACD00360TJJJJC@AGCCCTGCACCACCTAATAAGAACTGGAAAGTCEEDDDDDDDD", "HWI-D00360:5:H814YADXX:1:1102:11719:83361" [mem_sam_pe] paired reads have different names: "HWCTAAAATC:BDDDDFDDDDDDCEDDDHJJEHIIIJJJHHH>HFFEEEEET:83ACDDDDTAAATTEDDDDDDEDDDDJJFHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJ@T4BJJJTTATCTTG>FGGCAGGCTJJIJJJJJEDEECDDFAAGTAAADDDDDDDCTCTTCTTGTTTTCCCC>AGCC60:5:HC814YJDDDCCDDIGCCCTTC1IIIIHIEDDD@FFFCTTC1IIIIHIEDCCC;>CC60:5:H:0:CGADXX:1:1ATGTTTA:N:0:CGAC>CGAC>CG3AGGCTGAGGYADXX:JJJJJJJJIJJA0360GAIAGEEDEEEEC:GJIIJJJC:0:CGATGIFFFHHHHHJJJJJDEDDDDDGDEDDDDGTTTTTAT@HHJJJTGT", "HWI-D00360:5:H814YADXX:1:1102:11549:83491" [mem_sam_pe] paired reads have different names: "HWCATCCTCCCAAGACTAADD@FFFC99:833C99:833CGCTTTGFHH@FFFFDDDCCCDCFB:>CA8>A??JJHHHF60:5:HHH@FFFGTGGCAGGCTCCTGTAACGDDDDDDDDATGAACTCIACTAGDDDBBDDG9ATGGAATTTGACTTGADXX:1CACCTGCCAAACATACCCGTCTTTACC(G36CAGACCACCTGGACTTCCAGGEECDCDCDGAGGCCTGGCCATGTTATATGAAGTGIDXX:1CACCTGCCAAACATACCCGT", "HWI-D00360:5:H814YADXX:1:1102:11746:83407" [mem_sam_pe] paired reads have different names: "HWACTATTDEFFFHHHHCCTTGTGTE:@DDDD49?IJJIGIG83407", "HWI-D00360:5:H814YADXX:1:1102:11545:83354"

I have tried to sorting the 2 files using fastq-sort but still getting the same error, anyone can help ?

chunlinxiao commented 1 year ago

You need to use the sequence.index file (https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina300X_wgs_09252015 in your case) to match R1 and R2 files.

For 300X ILMN raw reads, some R1/R2 files may have same names, but located in different directories, e.g.,

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cabfe5b609fb1fe11619fdc72060185c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz 6f0faed9249c1a850e6ce57c61e26e04 HG001

ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cc35b61053fe7505715f93175bbb16c4 ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz cd12a23c3d71061e1bc673ce8c598dba HG001

Hope this helps.

Mahmoudbassuoni commented 1 year ago

yeah I have used the forward and the reverse reads for the same run from the same folder which is supposed to be on the same line in the link you posted. so I mean I used the links for the ftp from one line which is supposed to be matching the same run.

chunlinxiao commented 1 year ago

In your example, can you post the full path of the two files you were using for mapping? have you checked the md5?

Mahmoudbassuoni commented 1 year ago

Hi, That was the Forward strand: "ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz" and that was the Reverse one: "ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz"

Mahmoudbassuoni commented 1 year ago

I have checked the md5 now and it looks something wrong with the files download, I am downloading it now and will check it again, and get back to you. Thanks,

Mahmoudbassuoni commented 1 year ago

Hi , @chunlinxiao I have downloaded the files again but still the output of the md5sum not matching the one on the ftp site, I am not sure what could be wrong, I have tried the same thing with another 2 strands and the same happens.

Mahmoudbassuoni commented 1 year ago

I have tried to do the alignment process using 2 paired reads from the folder "giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/" and it went fine but I am not sure of the data quality as those files are from 2014 however the other files above are from 2020 so it is supposed to be more reliable

chunlinxiao commented 1 year ago

thanks for the update and glad your alignment process was fine now - I also tested your pairs on our side, nothing was wrong, so the paired data is fine.

Regarding the md5, we recently performed a metadata collection/analysis regarding all fastqs, involving gunzip/gzip - this may produce different md5s (from different gz file header if not using gzip -n ). However, the uncompressed file (fastq file) are unchanged with identical md5. The sequence.index files may need to be updated accordingly.

Mahmoudbassuoni commented 1 year ago

so what do you think of depending on the old FastQs from 2014 ? I am running a benchmarking process so is it fine to use those fastqs and then using the VCFs from the NIST V4 directory ?

jzook commented 1 year ago

Hi @Mahmoudbassuoni - all of the files in those directories were generated ~2014. They are probably ok to use for some purposes, but if you want to understand how your methods work on more recent illumina data, you may want to use data from this publication: https://doi.org/10.1101/2020.12.11.422022.

chunlinxiao commented 1 year ago

Hi @Mahmoudbassuoni , the md5s were updated in sequence.index.NA12878_Illumina300X_wgs_09252015_updated (you can follow the link from the table).