RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

Edges.tsv too large for upload to Git LFS #211

Closed acevedol closed 2 years ago

acevedol commented 2 years ago

One of the files generated for upload to the Knowledge Graph Exchange, edges.tsv, is much too large to upload to Git LFS. Edges.tsv comes out at 34GB+, and the best compression ratio I can get using xz -9 is about 13.9. This produces a file that is about 4.7GB, still too large to push to Git LFS which has a max file size of 4GB.

I propose splitting edges.tsv into two or more tsv files, then compressing these. The resulting compressed edges1.tsv.xz and edges2.tsv.xz should be completely capable of upload.

@saramsey Do you see any potential issues?

saramsey commented 2 years ago

Hi Lili, this plan (splitting edges.tsv) sounds like a practical option. Thank you. I assume this applies just to KG2pre, right? i.e., I assume KG2c still fits?

acevedol commented 2 years ago

yes, this is just for KG2pre, and just for the edges. I tried this out with a script split_kgx_edges_tsv.py, and it gave me the files needed to compress enough for upload. I'm not sure if this needs to be part of the kgx tsv build process, but I'm going to upload the file for future use.