COST-ELTeC / ELTeC-por

Portuguese collection in ELTeC (European Literary Text Collection)
https://distant-reading.net
2 stars 4 forks source link

database #3

Open philbmz opened 1 year ago

philbmz commented 1 year ago

Hi, i would like to know how to use this as a database in python. because im trying to get some information from the xml by their tags, like "author", "date" (their release date), "title" and some others, but, the release date is not something padronized in the xml's, so, when i try to get the text in all the second tag "date" (for example) from the xml's, some of the date are the correct release date, but some others arent, cuz the correct ones are in other tag, the first or third one (according to the metadata csm). So, can i get these information in another way?

lb42 commented 1 year ago

Not sure if I understand your question correctly, but if by "release date" you mean publication date, you will find that this is always given in the second column of the metadata.csv file. It will appear in different places in the TEI Header (the XML version of the file) depending on the kind of bibliographic data provided. The date of the first edition, if available, should be located by an XPath like "sourceDesc//bibl[@type='firstEdition']/date" . Which files are you looking at?

philbmz commented 1 year ago

Yes, publication date, and im looking at those level 1 xml files Opera Instantâneo_2023-03-22_171837_zenodo org

dianamsmpsantos commented 1 year ago

Hi, I wonder whether you want to get the information from the xml files, or whether it is enough to use the metadata file. In case you want to get the date information from the xml files, you have to understand that there are potentially three dates: the first edition date (not always known), the date of the physical copy that was digitized, and/or the date of the digitization that was used for ELTeC. Which date do you want? And which cases do you mean "some of the date are the correct release date, but some others arent"? If you tell us which ones gave you problems, I might either correct it or explain why it is like that.

Anyway, from your mail you seem to use the second date... but there is no actual requirement that the second is consistently the same. What is encoded is whether it is inside

And the order of these may vary. Hope this helped Diana philbmz ***@***.***> escreveu no dia quarta, 22/03/2023 à(s) 21:19: > Yes, publication date, and im looking at those level 1 xml files > [image: Opera Instantâneo_2023-03-22_171837_zenodo org] > > > — > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >
philbmz commented 1 year ago

Yeah, im trying to read all those xml files with python, and make some study on it with NLP, for this i need the date of the first edition, but the way im getting those information is by the tag name, which wont work cuz the tag "date" is not always the first edition date in the same position for every xml. Im not sure if im making myself clear, but the example i gave by "second date" its the second position of the tags named "date", sometimes this second position gives me the first edition date, and sometimes another date. Anyway, i get that the method that im using its the problem, thanks for the help.

Sem título

just to exemplify, these are the first 3 times that the tag "date" appears, sometimes the first edition date will be the third one, but in others files it wont, i thought that these dates were padronized, but now i get that i have to get these information with another method, so again, thanks for the help