NCATSTranslator / Knowledge_Graph_Exchange_Registry

The Biomedical Data Translator Consortium site for development of Knowledge Graph Exchange Standards and Registry
MIT License
6 stars 3 forks source link

If files are uploaded in small chunks, how does the system crosslink and integrate the data? #26

Closed RichardBruskiewich closed 3 years ago

RichardBruskiewich commented 3 years ago

If files are uploaded in small chunks, how does the system cross-link and connect the data?

A partial answer is that the core “Register File Set” web site page is the control point which initiates a given file upload session, which aggregates all files uploaded during one session into one “KGE File Set”. The upload page has a “Done Uploading File Set” button for closing that session. The application back end manages the archive integration of the uploaded datasets (with metadata) into a single file set.

That said, a few questions remain:

  1. Since AWS S3 doesn't severely limit file sizes (at least, on the order of the anticipated knowledge graph dataset sizes), should incoming data be merged into standard file components, perhaps renamed, for greater back end uniformity of access?

  2. Could the upload "Data File" upload have radio buttons to select type of file being uploaded, e.g. "node file", "edge file" etc. as a hint to the system about integration? - DONE

  3. How should such large data be published to clients wishing to download it: in its original form (with the original file names) or in its "consolidated" form (normalized into a consolidate nodes.tsv and edges.tsv in an tar.gz archive).

  4. Should/could downloading be a "streaming" form of transaction or a discrete file download(?). Sort of solved as a workable download from the web. Q: How might this change for a CLI or program library solution?

RichardBruskiewich commented 3 years ago

Overall, KGE File Set compilation is now largely managed by the KgeaArchiveCatalog class in the kgea.session.catalog.Catalog module. Further iterations will refine.

RichardBruskiewich commented 3 years ago

@RichardBruskiewich will confer with a few Translator collaborators regarding their practical preferences for download archive structure and behaviour

RichardBruskiewich commented 3 years ago

We have converged on building and using a single normalized Tar.Gz archive for downloading.