clemente-lab / mmeds-meta

A database for storing and analyzing omics data
https://mmeds.org
2 stars 1 forks source link

Make full use of MongoDB for file storage #470

Open adamcantor22 opened 1 week ago

adamcantor22 commented 1 week ago

Is your feature request related to a problem? Please describe. Our documents in MongoDB currently serve a twofold purpose: to act as a record of the MMEDS actions that have been performed and to point to the relevant files in the minerva allocation. Ideally, we should actually be storing files in this database to the extent that this is possible. We should particularly be storing feature tables that result from analyses, which would make it much easier to use them for further analyses.

Additional context It should definitely be possible to store metadata and feature tables, MongoDB seems to have a 16MB size limit for individual documents but there is a chunking specification called GridFS that can be used to store larger files like FASTQs (https://www.mongodb.com/docs/manual/core/gridfs/?_ga=2.203260177.1800629566.1719414300-409848511.1718725591). Unclear if that has any size limit and how fast it is though. Also unclear is how much space is available to us on our MongoDB; does it automatically expand if we reach some limit? If so, it would free up lots of space to store FASTQs there instead of on our allocation. Although considering that we sometimes have to "dump and load" the database when we make significant changes to it, we would probably want to wait until #428 is resolved to remove the raw data from the allocation, and even then we should first back it up onto a NAS or hard disk.