Open wasade opened 11 years ago
I like this idea, with a few caveats:
Should pick_otus.py just output a biom file with no sample or observation metadata? Hmm... why not?
Nevermind - because then we don't know which sequences are in which OTU.
That would be a bit of a drawback...
On Oct 29, 2013, at 9:49 PM, "Greg Caporaso" notifications@github.com<mailto:notifications@github.com> wrote:
| Should pick_otus.py just output a biom file with no sample or observation metadata? Hmm... why not?
Nevermind - because then we don't know which sequences are in which OTU.
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1163#issuecomment-27359413.
1) Can we at least do "SampleID_sequencehash_number" instead of "SampleID_number"? The former will make things like dereplication trivial downstream and greatly simplifies associating dereplicated sequences to what's in the OTU maps. This should not break anything, though it will slightly exasperate #594.
2) Doing 1) wouldn't affect this
3) Doing 1) wouldn't affect this
4) No... but you came to that conclusion as well.
The downside here being that we're delaying the problem: we need to redo the actual OTU map format. We're at the breaking point for tab delimited.
The specific issue I'm looking to address is, given an OTU map for a large study (e.g., global gut), what are the sequence hashes associated with the sequence IDs in the OTU map? It is not feasible to have the sequences in memory, it is very expensive to issue repeated queries to a cache in a RDBMs, and it is not feasible to house a dict of {seq_id:hash} in memory in Python for large studies. The current solution would be to stage, from an RDBMs, the hashes in a NoSQL in-memory database using nested hash tables, but that is a large amount of compute for what could be something essentially for free if the hash is computed in split libraries.
Note that the current format for these identifiers is sampleID_someUniqueIdentifier
, where someUniqueIdentifier
doesn't have to be a number. I think that QIIME relies on there only being one _
in the sequence identifiers, though I'm not sure what would break if we changed that (though it's easy enough to find out if we need to).
What if someUniqueIdentifier
was sequenceHash.integer
. We could compute this in split_libraries*py
and nothing downstream would break (as you point out).
That would work, adds a small amount of complexity on splitting the ID but it does save the risk of breaking things
On Wed, Oct 30, 2013 at 8:30 AM, Greg Caporaso notifications@github.comwrote:
Note that the current format for these identifiers is sampleIDsomeUniqueIdentifier, where someUniqueIdentifier doesn't have to be a number. I think that QIIME relies on there only being one in the sequence identifiers, though I'm not sure what would break if we changed that (though it's easy enough to find out if we need to).
What if someUniqueIdentifier was sequenceHash.integer. We could compute this in split_libraries*py and nothing downstream would break (as you point out).
— Reply to this email directly or view it on GitHubhttps://github.com/qiime/qiime/issues/1163#issuecomment-27393519 .
What about of doing something like biom? We can try to came up with a standard format (maybe binary) that the OTU pickers can default. There is a lot of OTU pickers out there and new pickers are coming. Some new pickers are adapting the Uclust output, since they know that Uclust is widely used. This way, if the format is standard, the new OTU pickers can take it as a default output, and older OTU pickers developers may think on modify their pickers to support that format.
I see that this will incur a reasonable amount of refactoring on the QIIME side, but we can try to add it to the QIIME 2.0 milestone, and make QIIME 2.0 much more standard and stable. Something similar with the mapping files can happen to QIIME 2.0... I would be happy to work on this starting on January, when I'm done with classes.
What do you think?
I like the idea of a standardized format in QIIME 2.0. @josenavas, want to add an issue for that and assign it to the 2.0 milestone? I think we should do that, and when I'm in Boulder in December have a 2.0 planning meeting where we go over the proposed features and decide on what we're actually going to do.
@wasade, if you think we're settled on the updated identifiers, I can hook this up in split_libraries_fastq.py
if someone else can do it for split_libraries.py
.
I have created a new issue: #1169 I don't have permission to add it to the QIIME 2.0 milestone, though...
Given that we have begun encountering pathological cases in which we cannot parse OTU maps, I think its time we revisit them. How does the following structure sound?
This is more concise and describes the sequence and does not deviate heavily from the existing structure. This would necessitate a change in split libraries to encode the sequence IDs differently, but the change is simple:
Or, taking this a step further, split libraries could dereplicate leading to demultiplexed data that look like:
The motivation here is the frustration with representing OTU maps in QiiTa such that we can a) have sequence level resolution and b) maintain as small of a footprint as possible. This would have the benefit of reducing compute across the board as well since, for instance, pick OTUs would operate on non-redundant data (though I realize dereplicating a 20GB fasta file for this procedure would not be trivial). This also (temporarily) gets us around the issue parsing issue for large OTU maps. Solving the latter could probably be done relatively easily with a binary structure, which would be awesome, but I may be asking for to much...