Add NCBI taxonomy to a custom version of the name matching index

nickdos commented 2 years ago

Doug provided the following hints:

Augmenting the ALA taxonomy the steps are: get the NCBI taxonomy into shape using the python code in name-preprocessing, alter the LTC config, putting NCBI at the correct priority for overrides (you can get pretty specific about this). Run the LTC over everything. The result is a combined taxonomy. Build the index. Have an instance of NMS with this index, which should be buildable with a suitably configured Docker script. Import into the BIE of your choice.

Notes LTC = large taxon collider NMS = name matching service

Further comments:

It’s pretty much impossible to have a private, selectable source on namematching-ws. It’s reasonably easy to add a data source to the list of sources that gets ingested by the large taxon collider, which produces the normalised taxonomy. But after that it gets complicated. It’s also very easy to set up a completely different set of sources and produce another normalised taxonomy.

The index for namematching-ws relies on an index produced from the normalised taxonomy. Producing this index is straightforward but the result is unstable. It generates a tree index for accepted concepts and these jump about, even if you feed it the same taxonomy. The tree indices are important for the biocache and they’re one of the things making dynamic updates difficult.

The BIE imports the same normalised taxonomy as the index builder. However, it does this independently. It can, also, ingest other taxonomies but the results will not be pretty if they overlap.

My guess is that you want a completely separate taxonomy? In which case, the steps are: gently massage the NCBI taxonomy into the correct DwCA shape. If it must be combined with other taxonomies, feed it into the large taxon collider and collect the sausage on the other side. Once you have that, you can build a name matching index that drives the NMS and also feed it into a separate BIE instance.

nickdos commented 2 years ago

See https://github.com/ARGA-Genomes/arga-data/issues/11#issuecomment-1241400786

For now decided to not follow this path.

nickdos commented 2 years ago

I've made a second attempt at this using both the full GBIF backbone (#24) and also using just the NCBI taxonomy, taken from checklistbank.org. GBIF backbone failed due to simply being too big to load, as the NMI keeps the entire tree in memory and the GBIF backbone is almost 10x larger. The NCBI taxonomy failed after 12+ hours with a circular loop reference error, meaning 2 taxa pointed back to each other via parent IDs in some way. Seeing as it took 12+ hours to find the first circular reference and there could be dozens of these, I'm thinking its not a good use of my time running this continually for days. It is possible to exclude names via config but you have to know what they are before hand.

nickdos commented 2 years ago

Planning on re-run the merge of NCBI with the major ALA sources, using nectar-arga-dev-2.ala.org.au which has 32GB of memory. Will need to exclude the loop taxon first.

UPDATE 1: started run on nectar-arga-dev-2 - using screen tool UPDATE 2: Errored with same message, so config is not right - checking with Doug on what I did wrong.

nickdos commented 2 years ago

Failed again with error:

ERROR: [ScientificName] - Unable to find principal for SN[no code, TROCHIDAE, unranked]
ERROR: [TaxonomyBuilder] - Unable to combine taxa
java.lang.NullPointerException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
        at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
        at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
        at au.org.ala.names.index.Taxonomy.resolvePrincipal(Taxonomy.java:728)
        at au.org.ala.names.index.Taxonomy.resolve(Taxonomy.java:441)
        at au.org.ala.names.index.TaxonomyBuilder.main(TaxonomyBuilder.java:151)
Caused by: java.lang.NullPointerException
        at au.org.ala.names.index.TaxonConceptInstance.getResolvedAccepted(TaxonConceptInstance.java:1084)
        at au.org.ala.names.index.TaxonConceptInstance.getResolvedAccepted(TaxonConceptInstance.java:1019)
        at au.org.ala.names.index.ScientificName.findPrincipal(ScientificName.java:129)
        at au.org.ala.names.index.ScientificName.findPrincipal(ScientificName.java:53)
        at au.org.ala.names.index.Name.resolvePrincipal(Name.java:251)
        at au.org.ala.names.index.Taxonomy.lambda$resolvePrincipal$42(Taxonomy.java:728)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1652)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
null

Checking with Doug on how to avoid this.

ARGA-Genomes / arga-data

Add NCBI taxonomy to a custom version of the name matching index #9