BioLockJ-Dev-Team / sheepdog_testing_suite

Test suite for BioLockJ development team.
3 stars 8 forks source link

fix mystery failures in Rdp Parser #293

Closed IvoryC closed 3 years ago

IvoryC commented 3 years ago

Ke encountered this error from the RdpParser in the all_China dataset/pipeline.

java.lang.Exception: Invalid RDP confidence score | Required range [ 0.0 <= score <= 1.0 ] ---> Actual score = genus
    at biolockj.node.r16s.RdpNode.calculateScore(RdpNode.java:107)
    at biolockj.node.r16s.RdpNode.buildRdpNode(RdpNode.java:85)
    at biolockj.node.r16s.RdpNode.<init>(RdpNode.java:38)
    at biolockj.module.implicit.parser.r16s.RdpParser.parseSamples(RdpParser.java:60)
    at biolockj.module.implicit.parser.ParserModuleImpl.runModule(ParserModuleImpl.java:114)
    at biolockj.module.JavaModuleImpl.executeTask(JavaModuleImpl.java:55)
    at biolockj.Pipeline.executeModule(Pipeline.java:46)
    at biolockj.Pipeline.executeModules(Pipeline.java:196)
    at biolockj.Pipeline.runPipeline(Pipeline.java:140)
    at biolockj.BioLockJ.runPipeline(BioLockJ.java:344)
    at biolockj.BioLockJ.main(BioLockJ.java:102)

I've seen this error before... I think it was in one of the test datasets; but some minor change must have worked around it so I never came back to solve it.

Using Ke's pipeline files as an example ... I think I found the line in the data file that is causing the problem:

FCA5PCT:1:1104:21320:3969#/1        Bacteria    domain  1.0 Firmicutes  phylum  0.98    Bacilli class   0.85    Lactobacillales order   0.82    Enterococcaceae family  0.19    Melissococcus   genus   0.15
FCA5PCT:1:2102:24267:10048#/1       Bacteria    domain  1.0 Firmicutes  phylum  0.79    Bacilli class   0.7 Lactobacillales order   0.67    Enterococcaceae family  0.31    Melissococcus   genus   0.28
family  0.37    Faecalibacterium    genus   0.33
FCA5PCT:1:1104:9726:7550#/2 -   Bacteria    domain  1.0 Firmicutes  phylum  0.68    Negativicutes   class   0.36    Selenomonadales order   0.36    Veillonellaceae family  0.36    Megamonas   genus   0.15
FCA5PCT:1:1112:10700:14901#/2   -   Bacteria    domain  1.0 Firmicutes  phylum  1.0 Clostridia  class   1.0 Clostridiales   order   1.0 Ruminococcaceae family  1.0 Ruminococcus    genus   0.84

See that middle line? most of these lines have the same format, but that one looks like its only the last half of a line. The parser gets to that line and goes "whoa now! that's not ok!"

So now we have two new questions:

  1. Why did the classifier make a broken line like this? Can we prevent that from happening ?
  2. This was one line out of tens of thousands. Lets just choose to forgive the classifier. How can we make the parser resilient to broken line like this ?
IvoryC commented 3 years ago

We already have several lines that (according to the logs) are omitted because

2020-11-11 15:32:51 DEBUG RdpNode: Omit incomplete [ first_1A ] OTU missing the top taxonomy level: phylum, classifier output = FCA5PCT:1:2102:24267:10048#/1 Bacteria domain 1.0 Firmicutes phylum 0.79 Bacilli class 0.7 Lactobacillales order 0.67 Enterococcaceae family 0.31 Melissococcus genus 0.28

I haven't figured out why some lines give that message (it doesn't look like it's missing anything (?) )... that's a side quest. But it looks like the approach we took in the past to some problem was to print a message (which I think I will upgrade to a warning) and omit the line.

Do we expect that each line of RDP output has the same number of elements ? I could add a rule that notes the number of tokens in the first line, and if it finds a line with fewer than that, it tosses out the shorter line. I guess if it finds a line with more than that (the first line might be truncated) then it should throw an error. That would be quick and easy to code, but I don't know if that's actually a rule of RDP output.

The first element in each row is the sequence id, which could be anything, so its hard to make a test to make sure the first value meets expectations. The second element is always either a "-" or "", but I don't know if that's a rule or just what happens to be the case for all the files I've seen.

If I can find concrete answers about expectations, then coding in a check becomes easy.

IvoryC commented 3 years ago

(tip from Dr. Fodor)

It is possibly being rejected because of the "#" or another special character somewhere? How many lines like that are we removing?

I don't think every line in RDP is guaranteed to have the same number of tokens. It can sometimes have taxa like sub-class or something like that.

IvoryC commented 3 years ago

The call we make to the RDP classifier is pretty minimalist. We specify the output file and the parameter -f fixrank. There are more options we could use:

$ java -jar $RDP classify
Command Error: Require the output file for classification assignment
usage:  [options] <samplefile>[,idmappingfile] ...
 -b,--bootstrap_outfile <arg>   the output file containing the number of
                                matching assignments out of 100 bootstraps for
                                major ranks. Default is null
 -c,--conf <arg>                assignment confidence cutoff used to determine
                                the assignment count for each taxon. Range
                                [0-1], Default is 0.8.
 -d,--metadata <arg>            the tab delimited metadata file for the samples,
                                with first row containing attribute name and
                                first column containing the sample name
 -f,--format <arg>              tab-delimited output format:
                                [allrank|fixrank|biom|filterbyconf|db]. Default
                                is allRank.
                                allrank: outputs the results for all ranks
                                applied for each sequence: seqname, orientation,
                                taxon name, rank, conf, ...
                                fixrank: only outputs the results for fixed
                                ranks in order: domain, phylum, class, order,
                                family, genus
                                biom: outputs rich dense biom format if OTU or
                                metadata provided
                                filterbyconf: only outputs the results for major
                                ranks as in fixrank, results below the
                                confidence cutoff were bin to a higher rank
                                unclassified_node
                                db: outputs the seqname, trainset_no, tax_id,
                                conf.
 -g,--gene <arg>                16srrna, fungallsu, fungalits_warcup,
                                fungalits_unite. Default is 16srrna. This option
                                can be overwritten by -t option
 -h,--hier_outfile <arg>        tab-delimited output file containing the
                                assignment count for each taxon in the
                                hierarchical format. Default is null.
 -m,--biomFile <arg>            the input clluster biom file. The classification
                                result will replace the taxonomy of the
                                corresponding cluster id.
 -o,--outputFile <arg>          tab-delimited text output file for
                                classification assignment.
 -q,--queryFile                 legacy option, no longer needed
 -s,--shortseq_outfile <arg>    the output file containing the sequence names
                                that are too short to be classified
 -t,--train_propfile <arg>      property file containing the mapping of the
                                training files if not using the default. Note:
                                the training files and the property file should
                                be in the same directory.
 -w,--minWords <arg>            minimum number of words for each bootstrap
                                trial. Default(maximum) is 1/8 of the words of
                                each sequence. Minimum is 5

I think the fixrank option is supposed to guarantee that each line has the same number of tokens...but I'm not totally sure.

Better yet! The --hier_outfile option creates a file looks like it accomplishes everything that we aimed to accomplish in the parser. So rather than coding to the specific (and not totally clear) format of the fixrank output file, I think it would be better to just reformat the hier-outfile to be able to pass that to the BuildTaxaTables module...or just build taxa tables from it.

IvoryC commented 3 years ago

It looks like the RDP module adds the RdpParser as a post-requisite.
And the user typically adds BuiltTaxaTables. All downstream modules use the taxa tables; I don't think anything comes back for the OTU tables made by the parser.

I could make an alternative parser, the RdpHierParser, that uses the hier-output to build a taxa table. A config property, rdp.hierarchyCounts=Y, would cause the rdp module to add the --hier-outfile option, and to add the RdpHierParser instead of the RdpParser. This would side-step all the assumptions that the RdpParser makes about how to handle "unclassified" output; in favor of letting Rdp determine the tallies at each level.

IvoryC commented 3 years ago

Ke ran her pipeline again, and the problem line (shown above) was not there. However the parser still failed, possibly because the same issue occurred in a different file. This sounds like it is really an RDP problem, not sure what is causing these random errors. Not totally sure if a work-around is a good thing. Maybe?

IvoryC commented 3 years ago

The RdpHierParser exists. It will part of release v1.3.14. It may be helpful tool in dealing with this RDP problem, or if users don't want the "unclassified" groups to be formed.

The default parser, RdpParser also got some helpful debug statements so it will be easier to diagnose this problem in the future.

These changes don't fix the mystery RDP problem, but they cover as much as BioLockJ should do.