Sage-Bionetworks / sage-monorepo

Where OpenChallenges, Schematic, and other Sage open source apps are built
https://sage-bionetworks.github.io/sage-monorepo/
Apache License 2.0
21 stars 12 forks source link

feat(openchallenges): add EDAM Extract and Transform Processes #2564

Closed mdsage1 closed 3 months ago

mdsage1 commented 3 months ago

Description

EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the extract and transform portion.

Related Issue

Contribute to #2524 Contribute to #2548

Fixes #2547 Fixes #2563

Changelog

  1. Download a specified version of the EDAM ontology from https://github.com/Sage-Bionetworks/edamontology
  2. Transform the raw data into a Pandas dataframe that match the content of this file
  3. Start id values from 1 to mimic the behavior of SQL AUTO_INCREMENT.
  4. Print info and statistic about the data to the stdout
  5. Version of EDAM processed
  6. Number of concepts transformed (overall, operation, data, etc.)

Preview

image

mdsage1 commented 3 months ago

@tschaffter Quality Gate doesn't seem to be performing checks for this PR.

mdsage1 commented 3 months ago

@tschaffter This is ready for review. Version is now an environment variable and the Description includes a preview.

vpchung commented 3 months ago

@mdsage1 thanks for working on this!! You didn't ask me to, but I added some comments to the PR. Feel free to use them or ignore 😄

sonarcloud[bot] commented 3 months ago

Quality Gate Passed Quality Gate passed for 'openchallenges-edam-etl'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

tschaffter commented 3 months ago

Fixes #2546

This PR is part of #2546, so this PR should not be configured to close this ticket. I will remove it from the list.

mdsage1 commented 3 months ago

closed in error

mdsage1 commented 3 months ago

@mdsage1 ~Why do you show the std, mean and other metrics for the id column?~

Update: here are the information that the script should print

  • Version of EDAM processed
  • Number of concepts that will be added to the table

    • Total number of concepts

    • Number of concepts for the following category

    • Data concepts

    • Operation concepts

    • Format concepts

    • Operation concepts

    • Other concepts

@tschaffter Where would I find this data in the edam csv downloaded from GitHub. I do not see a column that indicates a concept name/category only a preferred label that may include some of the concepts that you've listed

tschaffter commented 3 months ago

Where would I find this data in the edam csv downloaded from GitHub.

You could get this information from the column class_id and a regex.

mdsage1 commented 3 months ago

Where would I find this data in the edam csv downloaded from GitHub.

You could get this information from the column class_id and a regex.

@tschaffter Please see an example of the class_id column below: image This column is a link that only differs in the sequence of numbers at the end of the link for each entry.

mdsage1 commented 3 months ago

@tschaffter I have updated the concept counts using the preferred_label column and removed the statistics like mean etc.

tschaffter commented 3 months ago

You could get this information from the column class_id and a regex.

See the suggestion I made above.

vpchung commented 3 months ago

You could get this information from the column class_id and a regex.

@mdsage1 alternatively, you can also do a replace() to remove the substring you don't need from class_id, if you're not comfortable with regex.

EDIT: Since you're interested in the number of concepts per category, you can actually use pandas' contains to get you closer to the count 🙂 e.g.

>>> df["class_id"].str.contains("data")
0        True
1        True
2        True
3        True
4        True
        ...  
3468    False
3469    False
3470    False
3471    False
3472    False
tschaffter commented 3 months ago

Prefer exact match to using contains (more future proof): contains would not work if the ontology were to have the concept Data and DataFormat, for example.

vpchung commented 3 months ago

if the ontology were to have the concept Data and DataFormat, for example.

Good point. Just shooting my shot here, but this can be overcome by using data_ (assuming they use "dataformat_"). Also, you can use regex with contains().

mdsage1 commented 3 months ago

@tschaffter I've updated the concept counts to use the class_id column and regex. The case has been ignored to avoid any future issues. I didn't use contains but used search() function from the regex module. I have prevented future issues with data, and any other concept name, listing as a match when there is an additional word following the word of interest by adding the underscore to the regex as @vpchung suggested.