Closed IvoryC closed 3 years ago
We already have several lines that (according to the logs) are omitted because
2020-11-11 15:32:51 DEBUG RdpNode: Omit incomplete [ first_1A ] OTU missing the top taxonomy level: phylum, classifier output = FCA5PCT:1:2102:24267:10048#/1 Bacteria domain 1.0 Firmicutes phylum 0.79 Bacilli class 0.7 Lactobacillales order 0.67 Enterococcaceae family 0.31 Melissococcus genus 0.28
I haven't figured out why some lines give that message (it doesn't look like it's missing anything (?) )... that's a side quest. But it looks like the approach we took in the past to some problem was to print a message (which I think I will upgrade to a warning) and omit the line.
Do we expect that each line of RDP output has the same number of elements ? I could add a rule that notes the number of tokens in the first line, and if it finds a line with fewer than that, it tosses out the shorter line. I guess if it finds a line with more than that (the first line might be truncated) then it should throw an error. That would be quick and easy to code, but I don't know if that's actually a rule of RDP output.
The first element in each row is the sequence id, which could be anything, so its hard to make a test to make sure the first value meets expectations. The second element is always either a "-" or "", but I don't know if that's a rule or just what happens to be the case for all the files I've seen.
If I can find concrete answers about expectations, then coding in a check becomes easy.
(tip from Dr. Fodor)
It is possibly being rejected because of the "#" or another special character somewhere? How many lines like that are we removing?
I don't think every line in RDP is guaranteed to have the same number of tokens. It can sometimes have taxa like sub-class or something like that.
The call we make to the RDP classifier is pretty minimalist. We specify the output file and the parameter -f fixrank
.
There are more options we could use:
$ java -jar $RDP classify
Command Error: Require the output file for classification assignment
usage: [options] <samplefile>[,idmappingfile] ...
-b,--bootstrap_outfile <arg> the output file containing the number of
matching assignments out of 100 bootstraps for
major ranks. Default is null
-c,--conf <arg> assignment confidence cutoff used to determine
the assignment count for each taxon. Range
[0-1], Default is 0.8.
-d,--metadata <arg> the tab delimited metadata file for the samples,
with first row containing attribute name and
first column containing the sample name
-f,--format <arg> tab-delimited output format:
[allrank|fixrank|biom|filterbyconf|db]. Default
is allRank.
allrank: outputs the results for all ranks
applied for each sequence: seqname, orientation,
taxon name, rank, conf, ...
fixrank: only outputs the results for fixed
ranks in order: domain, phylum, class, order,
family, genus
biom: outputs rich dense biom format if OTU or
metadata provided
filterbyconf: only outputs the results for major
ranks as in fixrank, results below the
confidence cutoff were bin to a higher rank
unclassified_node
db: outputs the seqname, trainset_no, tax_id,
conf.
-g,--gene <arg> 16srrna, fungallsu, fungalits_warcup,
fungalits_unite. Default is 16srrna. This option
can be overwritten by -t option
-h,--hier_outfile <arg> tab-delimited output file containing the
assignment count for each taxon in the
hierarchical format. Default is null.
-m,--biomFile <arg> the input clluster biom file. The classification
result will replace the taxonomy of the
corresponding cluster id.
-o,--outputFile <arg> tab-delimited text output file for
classification assignment.
-q,--queryFile legacy option, no longer needed
-s,--shortseq_outfile <arg> the output file containing the sequence names
that are too short to be classified
-t,--train_propfile <arg> property file containing the mapping of the
training files if not using the default. Note:
the training files and the property file should
be in the same directory.
-w,--minWords <arg> minimum number of words for each bootstrap
trial. Default(maximum) is 1/8 of the words of
each sequence. Minimum is 5
I think the fixrank
option is supposed to guarantee that each line has the same number of tokens...but I'm not totally sure.
Better yet! The --hier_outfile
option creates a file looks like it accomplishes everything that we aimed to accomplish in the parser. So rather than coding to the specific (and not totally clear) format of the fixrank output file, I think it would be better to just reformat the hier-outfile to be able to pass that to the BuildTaxaTables module...or just build taxa tables from it.
It looks like the RDP module adds the RdpParser as a post-requisite.
And the user typically adds BuiltTaxaTables. All downstream modules use the taxa tables; I don't think anything comes back for the OTU tables made by the parser.
I could make an alternative parser, the RdpHierParser, that uses the hier-output to build a taxa table.
A config property, rdp.hierarchyCounts=Y, would cause the rdp module to add the --hier-outfile
option, and to add the RdpHierParser instead of the RdpParser. This would side-step all the assumptions that the RdpParser makes about how to handle "unclassified" output; in favor of letting Rdp determine the tallies at each level.
Ke ran her pipeline again, and the problem line (shown above) was not there. However the parser still failed, possibly because the same issue occurred in a different file. This sounds like it is really an RDP problem, not sure what is causing these random errors. Not totally sure if a work-around is a good thing. Maybe?
The RdpHierParser exists. It will part of release v1.3.14. It may be helpful tool in dealing with this RDP problem, or if users don't want the "unclassified" groups to be formed.
The default parser, RdpParser also got some helpful debug statements so it will be easier to diagnose this problem in the future.
These changes don't fix the mystery RDP problem, but they cover as much as BioLockJ should do.
Ke encountered this error from the RdpParser in the all_China dataset/pipeline.
I've seen this error before... I think it was in one of the test datasets; but some minor change must have worked around it so I never came back to solve it.
Using Ke's pipeline files as an example ... I think I found the line in the data file that is causing the problem:
See that middle line? most of these lines have the same format, but that one looks like its only the last half of a line. The parser gets to that line and goes "whoa now! that's not ok!"
So now we have two new questions: