Closed egonw closed 5 years ago
Description Replacing animal testing in Europe has seen a lot of attention (drug discovery, chemical regulation) and for some industries no longer allowed (cosmetics). For both small molecules and nanomaterials, enabling access to existing knowledge minimizes the need for new experiments. This requires new, FAIR approaches for connecting and disseminating toxicology research. To boost Open Science and reduce the dependency on a few publishing channels (viz. journal publications, bespoke APIs), enriching webpages with Bioschemas annotation allows allows independent content to be discovered. However, without someone or something searching for such content and making it available, there is no incentive to add the Bioschemas content.
This work package will use the Bioschemas scraper (WP5) to discover annotation in toxicology-related resources for MolecularEntity (small compound toxicants) and ChemicalSubstance (nanomaterials). The scraper will run as a process on the OpenRiskNet/NanoCommons cloud, as a NextFlow workflow. As input it will take an initial, curated set of relevant toxicological websites based on knowledge of resources with deployed Bioschemas markup. This will scrape content from sites known to have deployed Bioschemas annotation, like the eNanoMapper database, ChEMBL, and Guide to Pharmacology. From the pages on these sites it will harvest linked web pages (which it will also crawl, with some predefined depth) and detect Bioschemas annotation for ChemicalSubstance and MolecularEntity. The resulting dataset will be made available as CCZero data.
Subsequently, the scraped data will be made available via public databases. First, the discovered data on nanomaterials will be made available via an eNanoMapper database instance running on the NanoCommons cloud, after converting the results into the eNanoMapper Turtle format and loading them into the instance. Second, for the small molecules, the output is primarily aimed at enriching Wikidata, a public database that can hold facts including their provenance for chemical structures. To demonstrate and disseminate the effort, this WP will work with the “starting” ELIXIR Toxicology Community to use this data, for example in international efforts (e.g. the NORMAN Network) around specific substance classes of interest, such as PFAS and bisphenols.
Evaluation The outcome will be evaluated primarily based on the uptime of the crawler, and amount of discovered data. Existing resources, such as ChEMBL, will be used to benchmark the success by independently querying for expected outcomes. Particular care will be given to the eNanoMapper instance that will hold the results, as this uses Bioschemas annotation itself.
Communication and Dissemination The progress of the SIS will be periodically reported to the ELIXIR Toxicology Community mailing list, the H2020 NanoCommons project, and the Wikidata community. Guidance will be given to the Toxicology Community on 1. how resources can be annotated with Bioschemas (MolecularEntity and ChemicalSubstance); 2. how results disseminated via the eNanoMapper instance and Wikidata can be used in and by other projects.
Timeline 24 months for setting up a crawler pipeline, aggregating useful resources and extracting information, and disseminating this via an eNanoMapper instance and Wikidata Month 1-6: running the crawler on many web pages using a parallel approach Month 7-12: scaling up the crawler and extracting Bioschemas Month 7-12: development of a data model for Wikidata for extracted Bioschemas data Month 13-18: collaborate with ELIXIR Community partners on adding Bioschemas annotation to more resources Month 13-18: putting the crawler in production and start generating CCZero result data sets Month 19-24: deposit result data in an eNanoMapper instance and in Wikidata
Submitted, but put on "putative" by ELIXIR until Elixir Tox Community is approved.
Coordinated by Alasdair Gray.