biocore / greengenes2

Processing support for Greengenes2
11 stars 2 forks source link

Sequence mapping information #2

Open ygouin opened 1 year ago

ygouin commented 1 year ago

Is there a way to get mapping/coordinate information about where the sequences are from?

wasade commented 1 year ago

Thanks, @ygouin! It the exact answer depends on the type of record. For the current release, we source data from:

All of those full length 16S are in the publicly available backbone artifact.

We additionally place 16S V4 fragments, and some full length records from SILVA separately. The 16S fragments come from Qiita, as obtained by redbiom. Those IDs are represented in three ways, either by ASV (e.g., TACAGAACCCCCGAGCGTTACCCGGATTTATTGGGCGTAAAGGGTCTGTAGGTGGTCACGTAAGTTTCAAGTTAAAGCTCTTCGGCTTAA), by MD5 (e.g., 48700b14c9d20c30b0f575d3e01e0e5c) or by an arbitrary 8-digit identifier (e.g., 22383138). The SILVA records retain their original ID (e.g., JQIO01000430.1097739.1099039)

With regards to coordinates, it depends on the exact data being considered. For the ASVs for example, they are all expected to be 16S V4, with many (hopefull all!) starting at 515F but samples included depend on accurate information from depositors. For the operons, I likely have the coordinates but would need to track them down -- I can do so if helpful. For LTP and GTDB, we use what is provided as is. For WoL, I think coordinates are provided in that resource but again I may be able to track down if needed.

I think that covers it. Please let me know if there are further questions!