Closed matyaskopp closed 1 year ago
So, to have the complete description of the corpus included with every format? I'm not sure anymore if that is really necessary. But I do agree that the readmes should be generated, and probably in MD. Maybe some basic stats too? But that is another topic.
So, to have the complete description of the corpus included with every format? I'm not sure anymore if that is really necessary.
Agree that it is not necessary. I did not realize that all formats are packed into two archives:
ParlaMint-XX.ana
ParlaMint-XX.ana
├── ParlaMint-XX.conllu
├── ParlaMint-XX.TEI.ana
│ └── Schema
└── ParlaMint-XX.vert
ParlaMint-XX
ParlaMint-XX
├── ParlaMint-XX.TEI
│ └── Schema
└── ParlaMint-XX.txt
So, we can introduce the "main" README, which can be placed in a parent directory:
ParlaMint-XX.ana
├── ParlaMint-XX.conllu
├── ParlaMint-XX.TEI.ana
│ └── Schema
├── ParlaMint-XX.vert
└── README.md
So the complete description will be with both archives ParlaMint-XX.tgz
and ParlaMint-XX.ana.tgz
(not every format)
Currently, once you unpack ParlaMint-XX.tgz or ParlaMint-XX.ana.tgz you do not get the result in a single directory like ParlaMint-LV/ or ParlaMint-LV.ana/ but several dicrectories, each for a particular format, as you illustrated above. So, we cannot have a top level README.md for a corpus, as it would not be clear for which corpus. Possible solutions:
Which one do you think is the preferable option? I don't really have a preference, each has advantages and, more to the point, disadvantages.
Currently, once you unpack ParlaMint-XX.tgz or ParlaMint-XX.ana.tgz you do not get the result in a single directory like ParlaMint-LV/ or ParlaMint-LV.ana/ but several dicrectories, each for a particular format, as you illustrated above. So, we cannot have a top level README.md for a corpus, as it would not be clear for which corpus.
oh, sorry, you are right. I used the GUI tool for unpacking and it introduced a directory which is not in the archive.
- Introduce a top level directory, but it would contain very few things: the README, and the 2-3 format directories
- Have not the default README.md but rather README-XX.md or README-XX.ana.md but this is not the default filename for the README, so, e.g. it might not display automatically on GitHub.
Which one do you think is the preferable option? I don't really have a preference, each has advantages and, more to the point, disadvantages.
I believe that option 2 is slightly better - we want to refer between various releases:
So every reference should work when you unpack all these archives into one directory. So readmes:
README-XX.md
README-XX.ana.md
README-XX.mt.md
README-XX.audio.md
will work.
And we can also introduce the main readmes for the whole release in future:
README.md
README.ana.md
README.mt.md
README.audio.md
with project description, release link and the list of included parliaments.
I believe that option 2 is slightly better
I just remembered that we actually require the top level directory: "A complete directory should be compressed, and the name of the compressed file should be the same as the directory it unpacks in."
then again, if anything, ParlaMint is entitled to bend the rules a bit... OK the filenames, except README-XX.mt.md should be README-XX-en.md, and then top level (if we need it) probably README.en.md
Is it enogh to add version and handle?
This has now been implemented, closing.
Currently, corpus distribution files contain basic information about the corpus generated from the template for all formats:
It can be enriched with documentation in Data/ParlaMint-XX/README.md
Note that, the current distro readme is in txt format, but the corpus README.md files are in MarkDown