Improve the OTU map structure

wasade commented 11 years ago

Given that we have begun encountering pathological cases in which we cannot parse OTU maps, I think its time we revisit them. How does the following structure sound?

observationID TAB sampleID:sequence_hash:count TAB ...

This is more concise and describes the sequence and does not deviate heavily from the existing structure. This would necessitate a change in split libraries to encode the sequence IDs differently, but the change is simple:

>sampleID:sequence_hash

Or, taking this a step further, split libraries could dereplicate leading to demultiplexed data that look like:

>sampleID:sequence_hash:count

The motivation here is the frustration with representing OTU maps in QiiTa such that we can a) have sequence level resolution and b) maintain as small of a footprint as possible. This would have the benefit of reducing compute across the board as well since, for instance, pick OTUs would operate on non-redundant data (though I realize dereplicating a 20GB fasta file for this procedure would not be trivial). This also (temporarily) gets us around the issue parsing issue for large OTU maps. Solving the latter could probably be done relatively easily with a binary structure, which would be awesome, but I may be asking for to much...

gregcaporaso commented 11 years ago

I like this idea, with a few caveats:

I don't think we'd want to implement the pre-otu-picking (i.e., modified split_libraries format) idea. It's good to keep that file format really simple, as it allows people to very easily get data into QIIME if we don't support the sequencing platform that their data is coming from. It would also require changes to all of the OTU pickers (see my comment 3).
Note that the uclust (and possibly usearch) OTU picker already does operate on non-redundant data (the sequences get collapsed/expanded in pick_otus.py). We'd have to be careful about putting that strategy in place generally, as some OTU pickers (UpArse, I think) dereplicate internally and use the abundance information.
I don't think putting this change in place would be trivial, as we'd need to modify all of the OTU pickers to support this. While it could be clunky, it might be worth support both the current OTU map format and the new one that you're proposing, and having the parser determine which format it got before parsing. We could then put this in place for the commonly used OTU pickers, but not bother with the ones that we don't use much (mothur, cdhit, prefix/suffix, trie, BLAST, ...). I do think we could do this pretty easily though (e.g., the first line in the new format has a format identifier.
Should pick_otus.py just output a biom file with no sample or observation metadata? Hmm... why not?

gregcaporaso commented 11 years ago

Should pick_otus.py just output a biom file with no sample or observation metadata? Hmm... why not?

Nevermind - because then we don't know which sequences are in which OTU.

rob-knight commented 11 years ago

That would be a bit of a drawback...

On Oct 29, 2013, at 9:49 PM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:

| Should pick_otus.py just output a biom file with no sample or observation metadata? Hmm... why not?

Nevermind - because then we don't know which sequences are in which OTU.

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1163#issuecomment-27359413.

wasade commented 11 years ago

1) Can we at least do "SampleID_sequencehash_number" instead of "SampleID_number"? The former will make things like dereplication trivial downstream and greatly simplifies associating dereplicated sequences to what's in the OTU maps. This should not break anything, though it will slightly exasperate #594.

2) Doing 1) wouldn't affect this

3) Doing 1) wouldn't affect this

4) No... but you came to that conclusion as well.

The downside here being that we're delaying the problem: we need to redo the actual OTU map format. We're at the breaking point for tab delimited.

The specific issue I'm looking to address is, given an OTU map for a large study (e.g., global gut), what are the sequence hashes associated with the sequence IDs in the OTU map? It is not feasible to have the sequences in memory, it is very expensive to issue repeated queries to a cache in a RDBMs, and it is not feasible to house a dict of {seq_id:hash} in memory in Python for large studies. The current solution would be to stage, from an RDBMs, the hashes in a NoSQL in-memory database using nested hash tables, but that is a large amount of compute for what could be something essentially for free if the hash is computed in split libraries.

gregcaporaso commented 11 years ago

Note that the current format for these identifiers is sampleID_someUniqueIdentifier, where someUniqueIdentifier doesn't have to be a number. I think that QIIME relies on there only being one _ in the sequence identifiers, though I'm not sure what would break if we changed that (though it's easy enough to find out if we need to).

What if someUniqueIdentifier was sequenceHash.integer. We could compute this in split_libraries*py and nothing downstream would break (as you point out).

wasade commented 11 years ago

That would work, adds a small amount of complexity on splitting the ID but it does save the risk of breaking things

On Wed, Oct 30, 2013 at 8:30 AM, Greg Caporaso notifications@github.comwrote:

Note that the current format for these identifiers is sampleIDsomeUniqueIdentifier, where someUniqueIdentifier doesn't have to be a number. I think that QIIME relies on there only being one in the sequence identifiers, though I'm not sure what would break if we changed that (though it's easy enough to find out if we need to).

What if someUniqueIdentifier was sequenceHash.integer. We could compute this in split_libraries*py and nothing downstream would break (as you point out).

— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1163#issuecomment-27393519 .

josenavas commented 11 years ago

What about of doing something like biom? We can try to came up with a standard format (maybe binary) that the OTU pickers can default. There is a lot of OTU pickers out there and new pickers are coming. Some new pickers are adapting the Uclust output, since they know that Uclust is widely used. This way, if the format is standard, the new OTU pickers can take it as a default output, and older OTU pickers developers may think on modify their pickers to support that format.

I see that this will incur a reasonable amount of refactoring on the QIIME side, but we can try to add it to the QIIME 2.0 milestone, and make QIIME 2.0 much more standard and stable. Something similar with the mapping files can happen to QIIME 2.0... I would be happy to work on this starting on January, when I'm done with classes.

What do you think?

gregcaporaso commented 11 years ago

I like the idea of a standardized format in QIIME 2.0. @josenavas, want to add an issue for that and assign it to the 2.0 milestone? I think we should do that, and when I'm in Boulder in December have a 2.0 planning meeting where we go over the proposed features and decide on what we're actually going to do.

@wasade, if you think we're settled on the updated identifiers, I can hook this up in split_libraries_fastq.py if someone else can do it for split_libraries.py.

josenavas commented 11 years ago

I have created a new issue: #1169 I don't have permission to add it to the QIIME 2.0 milestone, though...

biocore / qiime

Improve the OTU map structure #1163