RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
36 stars 8 forks source link

Start naming KG2 TSV tarball with version number (in S3)? #140

Open amykglen opened 2 years ago

amykglen commented 2 years ago

Instead of extracting KG2pre data via Neo4j, going forward the KG2c build process is going to ingest kg2-tsv-for-neo4j.tar.gz (downloaded from the rtx-kg2 S3 bucket).

Wondering if it would be reasonable to start naming that tarball in S3 with the KG2 version number? So, something like: kg2-7-2-tsv-for-neo4j.tar.gz

I realize that means we'd have to periodically delete old versions of the file so the S3 bucket doesn't get overly full, but it'd be really nice for the KG2c build process to be able to make sure it gets the right tarball (since currently the tarball is overwritten every time a new KG2pre build is done).

saramsey commented 2 years ago

OK, I am thinking about how to do this while still preserving automation in the tsv-to-neo4j.sh script.

saramsey commented 2 years ago

I have created branch issue-140 for working this issue

saramsey commented 2 years ago

I have a mini build-system for the issue-140 branch working on my MBP, for development/test purposes for this issue.

saramsey commented 2 years ago

Lili and I discussed it and we feel this issue may slip until after 2.7.4

saramsey commented 1 year ago

Wondering if we can prioritize this for the next few weeks? @acevedol and @ecwood do you think it is doable?

saramsey commented 1 year ago

I'm specifically thinking that the output filenames that go to the S3 bucket should have the version number in the filename. I don't think the filenames on buildkg2.rtx.ai or whatever need to have the version number in the filename. Does that simplify things somewhat?

In hindsight, I don't think my decision to copy files like kg2-simplified.json to the S3 bucket without a version number in the filename, was a very good choice. Too much chance for confusion. It puts us in the position of having to check MD5 hashes or inspect the RTX:KG2 node in order to be sure which version the file is. We end up doing a surprising amount of that, and it seems like it could mostly be avoided if the S3 file artifacts had the version number embedded. Or were stored in a version--named folder on S3 (to avoid clutter in the bucket).

ecwood commented 1 year ago

I can try to work on this in the next few weeks. I like the idea of a version-named folder on S3 to avoid clutter.

amykglen commented 1 month ago

is there any way this could be implemented soon? really all we would like is that the kg2-tsv-for-neo4j.tar.gz in S3 is somehow named with its version number - either in the filename itself or by putting it in a subdirectory for that version. no need to change the file name within the KG2pre build itself (just upon upload to S3). it would be a big help for improving the robustness of KG2c builds.

ecwood commented 1 month ago

This should be done now. It will look something like kg2-tsv-for-neo4j-KG2.X.Y.tar.gz in the next build.

amykglen commented 1 month ago

awesome, thank you!!