Closed mdsage1 closed 3 months ago
@tschaffter Quality Gate doesn't seem to be performing checks for this PR.
@tschaffter This is ready for review. Version is now an environment variable and the Description includes a preview.
@mdsage1 thanks for working on this!! You didn't ask me to, but I added some comments to the PR. Feel free to use them or ignore 😄
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
Fixes #2546
This PR is part of #2546, so this PR should not be configured to close this ticket. I will remove it from the list.
closed in error
@mdsage1 ~Why do you show the std, mean and other metrics for the id column?~
Update: here are the information that the script should print
- Version of EDAM processed
Number of concepts that will be added to the table
Total number of concepts
Number of concepts for the following category
Data concepts
Operation concepts
Format concepts
Operation concepts
Other concepts
@tschaffter Where would I find this data in the edam csv downloaded from GitHub. I do not see a column that indicates a concept name/category only a preferred label that may include some of the concepts that you've listed
Where would I find this data in the edam csv downloaded from GitHub.
You could get this information from the column class_id
and a regex.
Where would I find this data in the edam csv downloaded from GitHub.
You could get this information from the column
class_id
and a regex.
@tschaffter Please see an example of the class_id column below: This column is a link that only differs in the sequence of numbers at the end of the link for each entry.
@tschaffter I have updated the concept counts using the preferred_label column and removed the statistics like mean etc.
You could get this information from the column class_id and a regex.
See the suggestion I made above.
You could get this information from the column class_id and a regex.
@mdsage1 alternatively, you can also do a replace()
to remove the substring you don't need from class_id
, if you're not comfortable with regex.
EDIT: Since you're interested in the number of concepts per category, you can actually use pandas' contains
to get you closer to the count 🙂 e.g.
>>> df["class_id"].str.contains("data")
0 True
1 True
2 True
3 True
4 True
...
3468 False
3469 False
3470 False
3471 False
3472 False
Prefer exact match to using contains
(more future proof): contains
would not work if the ontology were to have the concept Data
and DataFormat
, for example.
if the ontology were to have the concept Data and DataFormat, for example.
Good point. Just shooting my shot here, but this can be overcome by using data_
(assuming they use "dataformat_"). Also, you can use regex with contains()
.
@tschaffter I've updated the concept counts to use the class_id column and regex. The case has been ignored to avoid any future issues. I didn't use contains but used search() function from the regex module. I have prevented future issues with data, and any other concept name, listing as a match when there is an additional word following the word of interest by adding the underscore to the regex as @vpchung suggested.
Description
EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the extract and transform portion.
Related Issue
Contribute to #2524 Contribute to #2548
Fixes #2547 Fixes #2563
Changelog
Preview