Sefaria / Sefaria-Export

Structured Jewish texts and metadata exported from Sefaria's database.
Other
255 stars 165 forks source link

Duplicates leading to large repository sizes #27

Open pseudomonas opened 2 years ago

pseudomonas commented 2 years ago

cltk-flat and cltk-full seem to duplicate a lot of the content from the json directory. Each one of these directories is 4.1GB, meaning that a git clone operation is extremely slow and requires a lot of disk. (Sparse clone is theoretically possible but very fiddly to set up and very slow to execute, and it has problems with the number of files in the schema directory.)

Would it be possible to do one of the following?

  1. Put the cltk* material in a separate git repository
  2. Have a helper script that re-builds the cltk* based on the information in the json directory if needed
  3. Have a helper script that downloads the cltk* from an FTP site if needed
  4. Have the cltk* file trees use symlinks to, rather than duplicating the files from, json
  5. refactor the code not to need largely-redundant file trees