Open Mahmoudbassuoni opened 1 year ago
You need to use the sequence.index file (https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/sequence.index.NA12878_Illumina300X_wgs_09252015 in your case) to match R1 and R2 files.
For 300X ILMN raw reads, some R1/R2 files may have same names, but located in different directories, e.g.,
ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cabfe5b609fb1fe11619fdc72060185c ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz 6f0faed9249c1a850e6ce57c61e26e04 HG001
ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz cc35b61053fe7505715f93175bbb16c4 ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_006_AH81VLADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz cd12a23c3d71061e1bc673ce8c598dba HG001
Hope this helps.
yeah I have used the forward and the reverse reads for the same run from the same folder which is supposed to be on the same line in the link you posted. so I mean I used the links for the ftp from one line which is supposed to be matching the same run.
In your example, can you post the full path of the two files you were using for mapping? have you checked the md5?
Hi, That was the Forward strand: "ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz" and that was the Reverse one: "ftp://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R2_001.fastq.gz"
I have checked the md5 now and it looks something wrong with the files download, I am downloading it now and will check it again, and get back to you. Thanks,
Hi , @chunlinxiao I have downloaded the files again but still the output of the md5sum not matching the one on the ftp site, I am not sure what could be wrong, I have tried the same thing with another 2 strands and the same happens.
I have tried to do the alignment process using 2 paired reads from the folder "giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/"
and it went fine but I am not sure of the data quality as those files are from 2014 however the other files above are from 2020 so it is supposed to be more reliable
thanks for the update and glad your alignment process was fine now - I also tested your pairs on our side, nothing was wrong, so the paired data is fine.
Regarding the md5, we recently performed a metadata collection/analysis regarding all fastqs, involving gunzip/gzip - this may produce different md5s (from different gz file header if not using gzip -n ). However, the uncompressed file (fastq file) are unchanged with identical md5. The sequence.index files may need to be updated accordingly.
so what do you think of depending on the old FastQs from 2014 ? I am running a benchmarking process so is it fine to use those fastqs and then using the VCFs from the NIST V4 directory ?
Hi @Mahmoudbassuoni - all of the files in those directories were generated ~2014. They are probably ok to use for some purposes, but if you want to understand how your methods work on more recent illumina data, you may want to use data from this publication: https://doi.org/10.1101/2020.12.11.422022.
Hi @Mahmoudbassuoni , the md5s were updated in sequence.index.NA12878_Illumina300X_wgs_09252015_updated (you can follow the link from the table).
Hi, I am trying to run the alignment using bwa mem for the 2 files _"U0a_CGATGT_L001_R1001.fastq.gz" _"U0a_CGATGT_L001_R2001.fastq.gz" I already got from the FTP site with the reference _"GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz"_ and the command I am using is
bwa mem -t 16 -R '@RG\tID:H814YADXX.5.CGATGT.1101\tSM:HG001\tPL:illumina' GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz U0a_CGATGT_L001_R1_001.fastq.gz U0a_CGATGT_L001_R2_001.fastq.gz | samtools view -b - >HG001.GRCh38_no_alt_analysis_set.bam
but I am getting an error with the sequence headers:
I have tried to sorting the 2 files using fastq-sort but still getting the same error, anyone can help ?