genome-in-a-bottle / giab_data_indexes

This repository contains data indexes from NIST's Genome in a Bottle project.
232 stars 71 forks source link

removed duplicated and trailing whitespace #13

Open pickettbd opened 3 years ago

pickettbd commented 3 years ago

I removed the duplicated and trailing whitespace from index files. In some cases, 2 or more tabs were present between columns. I also removed the trailing whitespace at the end of the lines. Otherwise, the text remains the same.

Issues were fixed with GNU sed like this: sed -r -i 's,\t+,\t,g' file1 file2 ... fileN sed -r -i 's,[\t ]+$,,' file1 file2 ... fileN

Here is the list of affected files:

AshkenazimTrio/sequence.index.AJtrio_HG002_NIST_SOLiD5500W_xsq_09042015.HG002
AshkenazimTrio/alignment.index.AJtrio_Illumina_6kb_matepair_wgs_bwamem_GRCh37_07302015.HG002
AshkenazimTrio/alignment.index.AJtrio_Illumina_6kb_matepair_wgs_bwamem_GRCh37_07302015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015
AshkenazimTrio/sequence.index.AJtrio_HG002_Cornell_Oxford_Nanopore_fasta_fastq_10132015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG003
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG004
AshkenazimTrio/sequence.index.AJtrio_HG002_NIST_SOLiD5500W_xsq_09042015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG002
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG002
AshkenazimTrio/sequence.index.AJtrio_HG002_Cornell_Oxford_Nanopore_fasta_fastq_10132015.HG002
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG003
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG004
AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_isaac-align_hg19_06012016.HG004
AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_isaac-align_hg19_06012016
ChineseTrio/sequence.index.ChineseTrio_HG005_BioNano_bnx_10012015.HG005
ChineseTrio/sequence.index.ChineseTrio_HG005_NIST_SOLiD5500W_xsq_09042015
ChineseTrio/sequence.index.ChineseTrio_HG005_BioNano_bnx_10012015
ChineseTrio/sequence.index.ChineseTrio_HG005_NIST_SOLiD5500W_xsq_09042015.HG005
NA12878/sequence.index.NA12878_PacBio_MtSinai_NIST_hdf5_08182015
chunlinxiao commented 3 years ago

Thanks Brandon - I'll check how those occurred in index generation process.

chunlinxiao commented 3 years ago

For sequence.index files, we intended to have 5 columns for capturing paired reads with their md5s and sample name (column5). For alignment.index file, we intended to have 4 columns for bam and bam.bai with their md5.

All the examples you listed above regarding sequence.index files were not paired reads, thus two empty fields were included there.

For some reason during updating those 4 alignment index files, extra space or tab were introduced, but now have been fixed.

pickettbd commented 3 years ago

Okay- I think that makes sense. Let me just make sure I understand. You're saying that:

  1. The sequence index files should have 5 columns regardless of whether there are paired reads or single reads
  2. Alignment index files should have 4 columns.

Is it also safe to assume the following?

  1. Columns are tab-delimited for both sequence and alignment index files
  2. No spaces or tabs should trail the end of a line

Also- thanks for fixing those 4 alignment index files 😄 🙏

chunlinxiao commented 3 years ago

yeah your assumptions are correct !

Also, please inform us when you find any unusual in index files.

Personally I really appreciate your efforts in helping us to make this resource more valuable.

chunlin

pickettbd commented 3 years ago

Glad I can help 😄

If I come across any other things, I'll share my findings in an issue.

Another clarifying question for you: are the bionano alignment files supposed to have only 2 columns (XMAP_CMAP & XMAP_CMAP_MD5)?

chunlinxiao commented 3 years ago

actually Bionano xmap/camp index was an exception for alignment.index, and described in https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/README.ftp_structure:

The format of sequence.index (if no paired data, column 3 and 4 will be empty) as follow: For fastqs: FASTQ FASTQ_MD5 PAIRED_FASTQ PAIRED_FASTQ_MD5 NIST_SAMPLE_NAME

For hdf5: HDF5 HDF5_MD5 NIST_SAMPLE_NAME

For SOLiD xsq: XSQ XSQ_MD5 NIST_SAMPLE_NAME

For BioNano bnx: BNX BNX_MD5 NIST_SAMPLE_NAME

The format of alignment.index: For BAM: BAM BAM_MD5 BAI BAI_MD5

For BioNano XMAP or CMAP: XMAP_CMAP XMAP_CMAP_MD5

Many thanks to you Brandon.

chunlin