clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

readme file in corpus distribution #654

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Currently, corpus distribution files contain basic information about the corpus generated from the template for all formats:

README.TEI.ana.txt  README.TEI.txt  README.conll.txt  README.schema.txt  README.txt.txt  README.vert.txt

It can be enriched with documentation in Data/ParlaMint-XX/README.md

Note that, the current distro readme is in txt format, but the corpus README.md files are in MarkDown

TomazErjavec commented 1 year ago

So, to have the complete description of the corpus included with every format? I'm not sure anymore if that is really necessary. But I do agree that the readmes should be generated, and probably in MD. Maybe some basic stats too? But that is another topic.

matyaskopp commented 1 year ago

So, to have the complete description of the corpus included with every format? I'm not sure anymore if that is really necessary.

Agree that it is not necessary. I did not realize that all formats are packed into two archives:

ParlaMint-XX.ana

ParlaMint-XX.ana
├── ParlaMint-XX.conllu
├── ParlaMint-XX.TEI.ana
│   └── Schema
└── ParlaMint-XX.vert

ParlaMint-XX

ParlaMint-XX
├── ParlaMint-XX.TEI
│   └── Schema
└── ParlaMint-XX.txt

So, we can introduce the "main" README, which can be placed in a parent directory:

ParlaMint-XX.ana
├── ParlaMint-XX.conllu
├── ParlaMint-XX.TEI.ana
│   └── Schema
├── ParlaMint-XX.vert
└── README.md

So the complete description will be with both archives ParlaMint-XX.tgz and ParlaMint-XX.ana.tgz (not every format)

TomazErjavec commented 1 year ago

Currently, once you unpack ParlaMint-XX.tgz or ParlaMint-XX.ana.tgz you do not get the result in a single directory like ParlaMint-LV/ or ParlaMint-LV.ana/ but several dicrectories, each for a particular format, as you illustrated above. So, we cannot have a top level README.md for a corpus, as it would not be clear for which corpus. Possible solutions:

  1. Introduce a top level directory, but it would contain very few things: the README, and the 2-3 format directories
  2. Have not the default README.md but rather README-XX.md or README-XX.ana.md but this is not the default filename for the README, so, e.g. it might not display automatically on GitHub.

Which one do you think is the preferable option? I don't really have a preference, each has advantages and, more to the point, disadvantages.

matyaskopp commented 1 year ago

Currently, once you unpack ParlaMint-XX.tgz or ParlaMint-XX.ana.tgz you do not get the result in a single directory like ParlaMint-LV/ or ParlaMint-LV.ana/ but several dicrectories, each for a particular format, as you illustrated above. So, we cannot have a top level README.md for a corpus, as it would not be clear for which corpus.

oh, sorry, you are right. I used the GUI tool for unpacking and it introduced a directory which is not in the archive.

  1. Introduce a top level directory, but it would contain very few things: the README, and the 2-3 format directories
  2. Have not the default README.md but rather README-XX.md or README-XX.ana.md but this is not the default filename for the README, so, e.g. it might not display automatically on GitHub.

Which one do you think is the preferable option? I don't really have a preference, each has advantages and, more to the point, disadvantages.

I believe that option 2 is slightly better - we want to refer between various releases:

So every reference should work when you unpack all these archives into one directory. So readmes:

will work.

And we can also introduce the main readmes for the whole release in future:

with project description, release link and the list of included parliaments.

TomazErjavec commented 1 year ago

I believe that option 2 is slightly better

I just remembered that we actually require the top level directory: "A complete directory should be compressed, and the name of the compressed file should be the same as the directory it unpacks in."

then again, if anything, ParlaMint is entitled to bend the rules a bit... OK the filenames, except README-XX.mt.md should be README-XX-en.md, and then top level (if we need it) probably README.en.md

Is it enogh to add version and handle?

TomazErjavec commented 1 year ago

This has now been implemented, closing.