DCMLab / corpusinterface

Basic functionality to maintain and load corpora.
0 stars 0 forks source link

Single-file corpora #5

Open pettter opened 4 years ago

pettter commented 4 years ago

Support for corpora that is not a collection of files but segments of a single file.

pettter commented 4 years ago

We further split this into several different types of files, e.g.

JSONCorpus (like the iRealPro corpus) XMLCorpus (XPath navigation?) SQLiteCorpus? Others?

pettter commented 4 years ago

JSONCorpus has basic functionality as of 99e9f5a.

fabianmoss commented 4 years ago

several text files, humdrum, abc (not the ABC corpus), etc.

fabianmoss commented 4 years ago

lilypond, e.g. the Crueger cantional settings in this ZIP file: https://miami.uni-muenster.de/Record/c8e13273-c323-4c20-93f3-e3e6caff3224

pettter commented 4 years ago

I'm not sure how common it is to have the entire corpus as a single file in those formats - I've mostly seen them used to contain a single "piece", with corpora being collections of many such files.

fabianmoss commented 4 years ago

True. We should have a list of potential formats somewhere, though.

chfin commented 4 years ago

It can happen though, sometimes you have a summary file of a corpus, which contains some representation of all pieces. Think of the csv file in the Choro corpus, or the Jazz trees, which are all in a single JSON file. The questions is whether they have something useful in common.

pettter commented 4 years ago

The Jazz trees (and similar JSON corpora) are supported in a very basic way, and implementing a similar thing for CSV should be relatively straightforward.

I'll have a look at getting the Choro corpus in.

pettter commented 4 years ago

@fabianmoss The Choro corpus seems to be private at the moment?

fabianmoss commented 4 years ago

Yes. Because the paper is STILL in review. I can give you a copy of the file tomorrow if you remind me again 😉

pettter notifications@github.com schrieb am Di., 25. Feb. 2020, 14:45:

@fabianmoss https://github.com/fabianmoss The Choro corpus seems to be private at the moment?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DCMLab/CorpusInterface/issues/5?email_source=notifications&email_token=AECLXOA67OICWN3AWERLSFDREUOGZA5CNFSM4KKJRI32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM4ALJQ#issuecomment-590874022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECLXOF6ZQC2SRNJ4PKEQRTREUOGZANCNFSM4KKJRI3Q .

pettter commented 4 years ago

Ah, no that's fine, I can get the file just fine, it's just a question of if we could add it to the corpora.csv file as something download/loadable.

..But I guess I maybe should remove or at least obfuscate a little the ten-line excerpt I added to the git-test-corpus?

chfin commented 3 years ago

The previous implementations of single file corpora have been removed at some point. They are now developed in #28.