caporaso-lab / mockrobiota

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.
http://mockrobiota.caporasolab.us
BSD 3-Clause "New" or "Revised" License
77 stars 35 forks source link

Question about barcode length in mock2 and mock6 #77

Open MathieuCharles opened 7 years ago

MathieuCharles commented 7 years ago

Hello,

Thanks for explanations in #76.

one more question about mock 2 and 6.

The barcode indicated in mock 2 and 6 are respectively of length 12 and 6, but in their mock-index-reads.fastq, index are respectively of length 13 and 7.

exemple of mock2: barcode in sample_metadata.tsv is : ATCTGCCTGGAA If I search perfect match in the index fastq file I foun:

     23 AATCTGCCTGGAA
 243167 ATCTGCCTGGAAA
    446 ATCTGCCTGGAAC
     17 ATCTGCCTGGAAG
      1 ATCTGCCTGGAAN
    681 ATCTGCCTGGAAT
     62 TATCTGCCTGGAA

Is the correct barcode ATCTGCCTGGAAA (with a A at the end) ? What is your advice?

Same problem with mock6

      ACCTGT          ACCTCG          ACCGCA
    951 AACCTGT     195 AACCTCG      58 AACCGCA
   2212 ACCTGTA  210433 ACCTCGA    1245 ACCGCAA
 277218 ACCTGTC   36791 ACCTCGC    5589 ACCGCAC
   4911 ACCTGTG    1878 ACCTCGG  312775 ACCGCAG
   1399 ACCTGTT    5707 ACCTCGT    2041 ACCGCAT
     24 CACCTGT      16 CACCTCG       1 GACCGCA
      1 GACCTGT      10 GACCTCG      90 TACCGCA
     46 TACCTGT       1 NACCTCG 
                   1092 TACCTCG

Many thanks for these datasets!

nbokulich commented 7 years ago

@MathieuCharles thanks for finding this issue! I have not noticed this previously as, evidently, it does not impact the ability of qiime to demultiplex these data.

I am still trying to figure out why the barcode files are 1 nt longer than the sample-metadata (all data are provided by contributors so may take time to track down) but for now it seems like a reasonable assumption that the most common match (which in most cases appears to be more than 100-fold more common than other matches) is the correct one.