Open pickettbd opened 3 years ago
Thanks Brandon - I'll check how those occurred in index generation process.
For sequence.index files, we intended to have 5 columns for capturing paired reads with their md5s and sample name (column5). For alignment.index file, we intended to have 4 columns for bam and bam.bai with their md5.
All the examples you listed above regarding sequence.index files were not paired reads, thus two empty fields were included there.
For some reason during updating those 4 alignment index files, extra space or tab were introduced, but now have been fixed.
Okay- I think that makes sense. Let me just make sure I understand. You're saying that:
Is it also safe to assume the following?
Also- thanks for fixing those 4 alignment index files 😄 🙏
yeah your assumptions are correct !
Also, please inform us when you find any unusual in index files.
Personally I really appreciate your efforts in helping us to make this resource more valuable.
chunlin
Glad I can help 😄
If I come across any other things, I'll share my findings in an issue.
Another clarifying question for you: are the bionano alignment files supposed to have only 2 columns (XMAP_CMAP & XMAP_CMAP_MD5)?
actually Bionano xmap/camp index was an exception for alignment.index, and described in https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/README.ftp_structure:
The format of sequence.index (if no paired data, column 3 and 4 will be empty) as follow: For fastqs: FASTQ FASTQ_MD5 PAIRED_FASTQ PAIRED_FASTQ_MD5 NIST_SAMPLE_NAME
For hdf5: HDF5 HDF5_MD5 NIST_SAMPLE_NAME
For SOLiD xsq: XSQ XSQ_MD5 NIST_SAMPLE_NAME
For BioNano bnx: BNX BNX_MD5 NIST_SAMPLE_NAME
The format of alignment.index: For BAM: BAM BAM_MD5 BAI BAI_MD5
For BioNano XMAP or CMAP: XMAP_CMAP XMAP_CMAP_MD5
Many thanks to you Brandon.
chunlin
I removed the duplicated and trailing whitespace from index files. In some cases, 2 or more tabs were present between columns. I also removed the trailing whitespace at the end of the lines. Otherwise, the text remains the same.
Issues were fixed with GNU sed like this:
sed -r -i 's,\t+,\t,g' file1 file2 ... fileN
sed -r -i 's,[\t ]+$,,' file1 file2 ... fileN
Here is the list of affected files: