TIBHannover / diaspora

Digital Approaches for the Synthesis of Poorly Accessible Biodiversity Information (BacDive & Semantics)
2 stars 1 forks source link

convert BacDive data to RDF #2

Closed javadch closed 1 year ago

javadch commented 3 years ago

To deal with this, we followed an Agile approach in which we started with a small prototype then we plan to scale it for all BacDive database. In the following section, I will explain the steps used for the prototype. Currently, we are in the process of scaling our approach.

BD_table | Priority (10 is high) | short description | Foreign Keys

    • [ ] cell_morphology | 10 | morphology data like size/shape/gram stain and motility | ID_strains/ID_reference
    • [ ] colony_morphology | 10 | Colony shape and colour, incubation time, hemolysis | ID_strains/ID_reference
    • [ ] culture_pH | 10 | data on pH values | ID_strains/ID_reference
    • [ ] culture_temp | 10 | data on temperature values | ID_strains/ID_reference
    • [ ] halophily | 10 | halophilic data | ID_strains/ID_reference
    • [ ] strains | 10 | central table on strain data including species name, culture collection numbers and type strain status
    • [ ] enzymes | 8 | data on enzyme activity | ID_strains/ID_reference
    • [ ] met_antibiotica | 8 | Antibiotica data | ID_strains/ID_reference
    • [ ] met_production | 8 | Metabolite production data | ID_strains/ID_reference
    • [ ] met_util | 8 | Metabolite utilization data | ID_strains/ID_reference
    • [ ] origin | 8 | data on the origin and enrichment of a culture, sample type is the basis for the Isolation Source TAGS | ID_strains/ID_reference
    • [ ] oxygen_tolerance | 8 | data on oxygen relation | ID_strains/ID_reference
    • [ ] reference | 8 | Metadata for the references
    • [ ] risk_assessment | 8 | data on pathogenicity and risk assessment | ID_strains/ID_reference
    • [ ] spore_formation | 8 | data on spore formation ability of the bacteria | ID_strains/ID_reference
    • [ ] culture_medium | 7 | Medium data for cultivation, not standardized data | ID_strains/ID_reference
    • [ ] met_test | 7 | Metabolite test data: methyl red, Voges-Proskauer, Indole and Citrate | ID_strains/ID_reference
    • [ ] nutrition_type | 7 | Nutrition type, rather general data on the nutrition of a bacterium | ID_strains/ID_reference
    • [ ] GC_content | 6 | GC content of the DNA | ID_strains/ID_reference
    • [ ] multicellular_morphology | 6 | data on multicellular complex building ability, not standardized | ID_strains/ID_reference
    • [ ] pigmentation | 6 | data on pigmentation of the bacteria | ID_strains/ID_reference
    • [ ] sequence | 6 | metadata on sequences > might be split into Genome and 16S sequence in the near future | ID_strains/ID_reference
    • [ ] FA_meta | 5 | metadata for fatty acid profiles
    • [ ] FA_profile | 5 | fatty acid profiles, connect metadata with FK_FA_META >PK FA_meta | FK_FA_META/ID_strains/ID_reference
    • [ ] IS_cat1 | 5 | Vocabulary of the Isolation Source TAGS Cat1 (highest)
    • [ ] IS_cat2 | 5 | Vocabulary of the Isolation Source TAGS Cat2 (middle) | FK_Cat1 |  
    • [ ] IS_cat3 | 5 | Vocabulary of the Isolation Source TAGS Cat3 (lowest) | FK_Cat2 |  
    • [ ] IS_link | 5 | Isolation Source Tag data | Cat1_link/Cat2_link/Cat3_link/ID_strains/ID_origin
    • [ ] met_antibiogram | 4 | Antibiogram test data | ID_strains/ID_reference
    • [ ] met_antibiogram_meta | 4 | Metadata for antibiogram tests
    • [ ] murein | 4 | Murein (cell wall) data | ID_strains/ID_reference
    • [ ] tolerance | 4 | Data on tolerances against compounds, non standardized data | ID_strains/ID_reference
    • [ ] strain_history | 4 | data on the history of a strain | ID_strains/ID_reference
    • [ ] compound_production | 3 | not so well structured data on compound production, can be later moved to metabolite and enzyme tables | ID_strains/ID_reference
    • [ ] kit_api_20A | 3 | Test data from API 20A | ID_strains/ID_reference
    • [ ] kit_api_20A_meta | 3 | Metadata for API 20A
    • [ ] kit_api_20E | 3 | Test data from API 20E | ID_strains/ID_reference
    • [ ] kit_api_20E_meta | 3 | Metadata for API 20E
    • [ ] kit_api_20NE | 3 | Test data from API 20NE | ID_strains/ID_reference
    • [ ] kit_api_20NE_meta | 3 | Metadata for API 20NE
    • [ ] kit_api_20STR | 3 | Test data from API 20STR | ID_strains/ID_reference
    • [ ] kit_api_20STR_meta | 3 | Metadata for API 20STR
    • [ ] kit_api_50CHac | 3 | Test data from API 50CHac | ID_strains/ID_reference
    • [ ] kit_api_50CHac_meta | 3 | Metadata for API 50CHac
    • [ ] kit_api_50CHas | 3 | Test data from API 50CHas | ID_strains/ID_reference
    • [ ] kit_api_50CHas_meta | 3 | Metadata for API 50CHas
    • [ ] kit_api_CAM | 3 | Test data from API CAM | ID_strains/ID_reference
    • [ ] kit_api_CAM_meta | 3 | Metadata for API CAM
    • [ ] kit_api_coryne | 3 | Test data from API Coryne | ID_strains/ID_reference
    • [ ] kit_api_coryne_meta | 3 | Metadata for API Coryne
    • [ ] kit_api_ID32E | 3 | Test data from API ID32E | ID_strains/ID_reference
    • [ ] kit_api_ID32E_meta | 3 | Metadata for API ID32E
    • [ ] kit_api_ID32STA | 3 | Test data from API ID32STA | ID_strains/ID_reference
    • [ ] kit_api_ID32STA_meta | 3 | Metadata for API ID32STA
    • [ ] kit_api_LIST | 3 | Test data from API LIST | ID_strains/ID_reference
    • [ ] kit_api_LIST_meta | 3 | Metadata for API LIST
    • [ ] kit_api_NH | 3 | Test data from API NH | ID_strains/ID_reference
    • [ ] kit_api_NH_meta | 3 | Metadata for API NH
    • [ ] kit_api_rID32A | 3 | Test data from API rID32A | ID_strains/ID_reference
    • [ ] kit_api_rID32A_meta | 3 | Metadata for API rID32A
    • [ ] kit_api_rID32STR | 3 | Test data from API rID32STR | ID_strains/ID_reference
    • [ ] kit_api_rID32STR_meta | 3 | Metadata for API rID32STR
    • [ ] kit_api_STA | 3 | Test data from API STA | ID_strains/ID_reference
    • [ ] kit_api_STA_meta | 3 | Metadata for API STA
    • [ ] kit_api_zym | 3 | Test data from API ZYM | ID_strains/ID_reference
    • [ ] kit_api_zym_ec | 3 | Metadata for API ZYM
    • [ ] ncbi_all | 3 | Help table for matching to NCBI | ID_strains
    • [ ] observation | 3 | Unstructured, not standardized data that does not fit into other data fields | ID_strains/ID_reference
    • [ ] biosample | 2 | NCBI Biosample data | ID_strains
    • [ ] countries | 1 | help table for translating ISO/Country
    • [ ] culture_collection | 1 | help table for structuring and analysing culture collection numbers
Gautamshahi commented 3 years ago

Data Analysis and Table Priority To build an ontology, we first started with the data analysis. Data analysis means what kind of elements are there in the relational database. All total, we have 96 tables. We made a priority list based on the importance of the conversion of the table to RDF. Overall, there is a total of 2294578 records in the database. The SQL dump is available at https://nextcloud.dsmz.de/s/dwgPo2FskjCAMNX?path=%2FBacDive%20Dump

We decided to work on the table having a priority list higher than 0, so at the end, we need to convert 71 tables.

The priority list is available at https://github.com/TIBHannover/diaspora/tree/main/wp2/t2.1

A glimpse of Schema of BacDive database Bacdive_Schema

Gautamshahi commented 3 years ago

Ontology Development

We used the YAMO methodology by following the MOD guidelines to design the ontology. We aligned the ontology by following an upper ontology in order to make the ontology interoperable. It clearly identifies the semantics for very common terms, which play a major role in the vocabulary used in the domain discipline. DOLCE(a descriptive ontology for linguistic and cognitive engineering)} is used as the upper ontology for alignment.

The ontology file is available at https://github.com/TIBHannover/diaspora/tree/main/wp2/t2.1

Gautamshahi commented 3 years ago

Mapping Rules

For data mapping, we have used the SDM RDFIzer(https://github.com/SDM-TIB/SDM-RDFizer), which creates a mapping rule for converting tabular data into triples, and the detailed instructions are given below.

An example mapping rules is available at https://github.com/TIBHannover/diaspora/tree/main/wp2/t2.1

Gautamshahi commented 3 years ago

Ontology publication

• TIB would be ok, but maybe only as a second place, as it has not enough reach so far • I would favour BioPortal atm, which is handy and to my perception widely known. • Obo foundry is also well-established -> do we have overlap with existing ontologies? I would guess so? • OLS at EBI is smaller, still interesting.

Gautamshahi commented 3 years ago

BD_table | Priority (10 is high) | short description | Foreign Keys

    • [ ] cell_morphology | 10 | morphology data like size/shape/gram stain and motility | ID_strains/ID_reference
    • [ ] colony_morphology | 10 | Colony shape and colour, incubation time, hemolysis | ID_strains/ID_reference
    • [ ] culture_pH | 10 | data on pH values | ID_strains/ID_reference
    • [ ] culture_temp | 10 | data on temperature values | ID_strains/ID_reference
    • [ ] halophily | 10 | halophilic data | ID_strains/ID_reference
    • [ ] strains | 10 | central table on strain data including species name, culture collection numbers and type strain status
    • [ ] enzymes | 8 | data on enzyme activity | ID_strains/ID_reference
    • [ ] met_antibiotica | 8 | Antibiotica data | ID_strains/ID_reference
    • [ ] met_production | 8 | Metabolite production data | ID_strains/ID_reference
    • [ ] met_util | 8 | Metabolite utilization data | ID_strains/ID_reference
    • [ ] origin | 8 | data on the origin and enrichment of a culture, sample type is the basis for the Isolation Source TAGS | ID_strains/ID_reference
    • [ ] oxygen_tolerance | 8 | data on oxygen relation | ID_strains/ID_reference
    • [ ] reference | 8 | Metadata for the references
    • [ ] risk_assessment | 8 | data on pathogenicity and risk assessment | ID_strains/ID_reference
    • [ ] spore_formation | 8 | data on spore formation ability of the bacteria | ID_strains/ID_reference
    • [ ] culture_medium | 7 | Medium data for cultivation, not standardized data | ID_strains/ID_reference
    • [ ] met_test | 7 | Metabolite test data: methyl red, Voges-Proskauer, Indole and Citrate | ID_strains/ID_reference
    • [ ] nutrition_type | 7 | Nutrition type, rather general data on the nutrition of a bacterium | ID_strains/ID_reference
    • [ ] GC_content | 6 | GC content of the DNA | ID_strains/ID_reference
    • [ ] multicellular_morphology | 6 | data on multicellular complex building ability, not standardized | ID_strains/ID_reference
    • [ ] pigmentation | 6 | data on pigmentation of the bacteria | ID_strains/ID_reference
    • [ ] sequence | 6 | metadata on sequences > might be split into Genome and 16S sequence in the near future | ID_strains/ID_reference
    • [ ] FA_meta | 5 | metadata for fatty acid profiles
    • [ ] FA_profile | 5 | fatty acid profiles, connect metadata with FK_FA_META >PK FA_meta | FK_FA_META/ID_strains/ID_reference
    • [ ] IS_cat1 | 5 | Vocabulary of the Isolation Source TAGS Cat1 (highest)
    • [ ] IS_cat2 | 5 | Vocabulary of the Isolation Source TAGS Cat2 (middle) | FK_Cat1 |  
    • [ ] IS_cat3 | 5 | Vocabulary of the Isolation Source TAGS Cat3 (lowest) | FK_Cat2 |  
    • [ ] IS_link | 5 | Isolation Source Tag data | Cat1_link/Cat2_link/Cat3_link/ID_strains/ID_origin
    • [ ] met_antibiogram | 4 | Antibiogram test data | ID_strains/ID_reference
    • [ ] met_antibiogram_meta | 4 | Metadata for antibiogram tests
    • [ ] murein | 4 | Murein (cell wall) data | ID_strains/ID_reference
    • [ ] tolerance | 4 | Data on tolerances against compounds, non standardized data | ID_strains/ID_reference
    • [ ] strain_history | 4 | data on the history of a strain | ID_strains/ID_reference
    • [ ] compound_production | 3 | not so well structured data on compound production, can be later moved to metabolite and enzyme tables | ID_strains/ID_reference
    • [ ] kit_api_20A | 3 | Test data from API 20A | ID_strains/ID_reference
    • [ ] kit_api_20A_meta | 3 | Metadata for API 20A
    • [ ] kit_api_20E | 3 | Test data from API 20E | ID_strains/ID_reference
    • [ ] kit_api_20E_meta | 3 | Metadata for API 20E
    • [ ] kit_api_20NE | 3 | Test data from API 20NE | ID_strains/ID_reference
    • [ ] kit_api_20NE_meta | 3 | Metadata for API 20NE
    • [ ] kit_api_20STR | 3 | Test data from API 20STR | ID_strains/ID_reference
    • [ ] kit_api_20STR_meta | 3 | Metadata for API 20STR
    • [ ] kit_api_50CHac | 3 | Test data from API 50CHac | ID_strains/ID_reference
    • [ ] kit_api_50CHac_meta | 3 | Metadata for API 50CHac
    • [ ] kit_api_50CHas | 3 | Test data from API 50CHas | ID_strains/ID_reference
    • [ ] kit_api_50CHas_meta | 3 | Metadata for API 50CHas
    • [ ] kit_api_CAM | 3 | Test data from API CAM | ID_strains/ID_reference
    • [ ] kit_api_CAM_meta | 3 | Metadata for API CAM
    • [ ] kit_api_coryne | 3 | Test data from API Coryne | ID_strains/ID_reference
    • [ ] kit_api_coryne_meta | 3 | Metadata for API Coryne
    • [ ] kit_api_ID32E | 3 | Test data from API ID32E | ID_strains/ID_reference
    • [ ] kit_api_ID32E_meta | 3 | Metadata for API ID32E
    • [ ] kit_api_ID32STA | 3 | Test data from API ID32STA | ID_strains/ID_reference
    • [ ] kit_api_ID32STA_meta | 3 | Metadata for API ID32STA
    • [ ] kit_api_LIST | 3 | Test data from API LIST | ID_strains/ID_reference
    • [ ] kit_api_LIST_meta | 3 | Metadata for API LIST
    • [ ] kit_api_NH | 3 | Test data from API NH | ID_strains/ID_reference
    • [ ] kit_api_NH_meta | 3 | Metadata for API NH
    • [ ] kit_api_rID32A | 3 | Test data from API rID32A | ID_strains/ID_reference
    • [ ] kit_api_rID32A_meta | 3 | Metadata for API rID32A
    • [ ] kit_api_rID32STR | 3 | Test data from API rID32STR | ID_strains/ID_reference
    • [ ] kit_api_rID32STR_meta | 3 | Metadata for API rID32STR
    • [ ] kit_api_STA | 3 | Test data from API STA | ID_strains/ID_reference
    • [ ] kit_api_STA_meta | 3 | Metadata for API STA
    • [ ] kit_api_zym | 3 | Test data from API ZYM | ID_strains/ID_reference
    • [ ] kit_api_zym_ec | 3 | Metadata for API ZYM
    • [ ] ncbi_all | 3 | Help table for matching to NCBI | ID_strains
    • [ ] observation | 3 | Unstructured, not standardized data that does not fit into other data fields | ID_strains/ID_reference
    • [ ] biosample | 2 | NCBI Biosample data | ID_strains
    • [ ] countries | 1 | help table for translating ISO/Country
    • [ ] culture_collection | 1 | help table for structuring and analysing culture collection numbers