clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Annotating words with USAS #204

Closed TomazErjavec closed 10 months ago

TomazErjavec commented 2 years ago

The USAS semantic tags will be encoded in a taxonomy (cf. #202), but there remains the question of how to encode these tags (or, rather, references to the IDs of the taxomomy categories) on word tokens. An important complication is that USAS can also tag multi-word expressions (MWEs).

One option would be to directly mark the USAS tag in w/@ana, and, for MWEs, introduce a new element (probably phr) and mark phr/@ana. However, there is a real danger that phr will at times conflict with name, leading to non-well formed XML or difficult fixes.

An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).

In line with this, the encoding (suitably simplified) could be like:

<s xml:id="s1">
 <w xml:id="t1">I</w>
 <w xml:id="t2">therefore</w>
 <w xml:id="t3">very</w>
 <w xml:id="t4">much</w>
 <w xml:id="t5">welcome</w>
 <w xml:id="t6">the</w>
 <w xml:id="t7">Government's</w>
 <w xml:id="t8">intention</w>
 <linkGrp type="USAS-SEM">
   <ptr ana="usas:Z8" target="#t1"/>
   <ptr ana="usas:Z5" target="#t2"/>
   <ptr ana="usas:A13.3" target="#t3 #t4"/>
   <ptr ana="usas:Q2.2" target="#t5"/>
   <ptr ana="usas:Z5" target="#t6"/>
   <ptr ana="usas:G1.1" target="#t7"/>
   <ptr ana="usas:X7p" target="#t8"/>
 </linkGrp>
</s>

@matyaskopp, do you see any problems with this suggestion?

JohnVidler commented 12 months ago

Hello folks - I've thrown some 'big compute' at the software that Paul handed over, and I'm happy to report that barring some final checks and handling any errors (of which there seem to be very few!) that I'm now just awaiting the remaining datasets for processing.

Once @perayson has passed his expert eye over the output we'll report back on the status of everything.

JohnVidler commented 12 months ago

@TomazErjavec I'm happy to say we now have the following state over all language sets:

Set Ready Processed Packaged
AT-en No
BA-en Yes Yes Yes
BE-en Yes Yes Yes
BG-en Yes Yes Yes
CZ-en No
DK-en Yes Yes Yes
EE-en Yes Yes Yes
ES-CT-en Yes Yes Yes
ES-GA-en Yes Yes Yes
FR-en Yes Yes Yes
GR-en Yes Yes Yes
HR-en Yes Yes Yes
HU-en No
IS-en Yes Yes Yes
IT-en Yes Yes Yes
LV-en Yes Yes Yes
NL-en Yes Yes Yes
NO-en Yes Yes Yes
PL-en Yes Yes Yes
PT-en Yes Yes Yes
RS-en Yes Yes Yes
SE-en Yes Yes Yes
SI-en Yes Yes Yes
TR-en No
UA-en No
FI-en No
LT-en No

While there are a couple of minor errors in there which I'm currently addressing; the packaged .tar.gz files are now available at http://ucrel-api-01.lancaster.ac.uk/vidler/

Once any of the remaining 7 are ready to go, let me know and I'll run them through the process 🙂

matyaskopp commented 12 months ago

@JohnVidler, the Spanish ParlaMint-ES is missing from your table (It is newly added in ParlaMint3.1) I believe AT, ES and CZ are ready for annotation, but please wait for @TomazErjavec's confirmation.

JohnVidler commented 12 months ago

@matyaskopp Is there a plain 'ParlaMint-ES' source file? I see we have an ES-CT and ES-GA already, but I don't seem to be able to find just ES.

TomazErjavec commented 12 months ago

@JohnVidler, @matyaskopp, sorry for the silence, too many open fronts right now... First, congratulations on the number of already annotated corpora, very nice! I now put ES on https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-ES-en.conllu.zip Will let you know as others become available.

Pls. note that ParlaMint-GB should also be annotated, you can find it at https://nl.ijs.si/et/tmp/ParlaMint/MT/ParlaMint-GB.conllu.zip As discussed above, the CoNLL-U format is slightly different for this non-translated corpus, but they are in the metadata fields, so I hope they won't complicate your pipeline. If anything is unclear, pls. ask!

JohnVidler commented 12 months ago

@TomazErjavec No worries on the silence - I think I rather got through the existing input files faster than expected 🙂

First, congratulations on the number of already annotated corpora, very nice!

We can go faster! I'm using the workloads to test some new virtual hardware we have access to; for the curious I'm running these on a 64-core ARM AARCH64 system, with 380G of RAM and running the input files in batches of 4 languages simultaneously, but we could bump that up to 128-cores and 512GB RAM, and run in batches of 8 languages or more at once.

I now put ES on https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-ES-en.conllu.zip Will let you know as others become available.

Neither this nor the other the link you just posted here seem to be accessible from this end yet. I'll poll again later to see if they're up and start the runs once they become available.

TomazErjavec commented 12 months ago

I think I rather got through the existing input files faster than expected

Indeed, Paul had much more pessimstic projections! And, given your impressive hardware, we don't have much to worry about here.

Neither this nor the other the link you just posted here seem to be accessible

We seem to have problems with the network inside the institute, which is a pain also for me, sorry about this, it happens very seldomly. I hope they sort it out soon, will let you know if I notice the problem is gone.

JohnVidler commented 12 months ago

I think I rather got through the existing input files faster than expected

Indeed, Paul had much more pessimstic projections! And, given your impressive hardware, we don't have much to worry about here.

I'd rather have the performance here so we gain some time to re-run anything with issues :+1:

We seem to have problems with the network inside the institute...

No worries - I'll keep an eye on the thread here and start stuff off when the files become available.

TomazErjavec commented 11 months ago

I'll keep an eye on the thread here and start stuff off when the files become available.

Fixed, so pls. go ahead!

JohnVidler commented 11 months ago

I ran both ES-en and GB, but the parser didnt much like the format for GB. ES-en worked just fine though, and is now available at the same URL as the others. I'll dig through the logs for GB tomorrow and see what it doesn't like and look to fix that one.

Are AT and CZ ready to go?

Here's the updated table:

Set Ready Processed Packaged
AT-en No
BA-en Yes Yes Yes
BE-en Yes Yes Yes
BG-en Yes Yes Yes
CZ-en No
DK-en Yes Yes Yes
EE-en Yes Yes Yes
ES-en Yes Yes *** Yes ***
ES-CT-en Yes Yes Yes
ES-GA-en Yes Yes Yes
FR-en Yes Yes Yes
GB Yes Failing
GR-en Yes Yes Yes
HR-en Yes Yes Yes
HU-en No
IS-en Yes Yes Yes
IT-en Yes Yes Yes
LV-en Yes Yes Yes
NL-en Yes Yes Yes
NO-en Yes Yes Yes
PL-en Yes Yes Yes
PT-en Yes Yes Yes
RS-en Yes Yes Yes
SE-en Yes Yes Yes
SI-en Yes Yes Yes
TR-en No
UA-en No
FI-en No
LT-en No
TomazErjavec commented 11 months ago

Are AT and CZ ready to go?

Yes: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/?C=M;O=D

calzada commented 11 months ago

Excellent news for ES-en. Thanks for the good work. Mc

El vie, 15 sept 2023, 1:00, Dr John Vidler @.***> escribió:

I ran both ES-en and GB, but the parser didnt much like the format for GB. ES-en worked just fine though, and is now available at the same URL as the others.

Are AT and CZ ready to go?

Here's the updated table: Set Ready Processed Packaged AT-en No BA-en Yes Yes Yes BE-en Yes Yes Yes BG-en Yes Yes Yes CZ-en No DK-en Yes Yes Yes EE-en Yes Yes Yes ES-en Yes Yes Yes ES-CT-en Yes Yes Yes ES-GA-en Yes Yes Yes FR-en Yes Yes Yes GB Yes Failing GR-en Yes Yes Yes HR-en Yes Yes Yes HU-en No IS-en Yes Yes Yes IT-en Yes Yes Yes LV-en Yes Yes Yes NL-en Yes Yes Yes NO-en Yes Yes Yes PL-en Yes Yes Yes PT-en Yes Yes Yes RS-en Yes Yes Yes SE-en Yes Yes Yes SI-en Yes Yes Yes TR-en No UA-en No FI-en No LT-en No

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/204#issuecomment-1720262339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREQIBC4B7ACY7J7U3X3X2OEA5ANCNFSM5SCVFTXA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

calzada commented 11 months ago

Btw, what do the three asterix dign mean in ES-en? Best for now Mc

El sáb, 16 sept 2023, 1:26, María Calzada Pérez @.***> escribió:

Excellent news for ES-en. Thanks for the good work. Mc

El vie, 15 sept 2023, 1:00, Dr John Vidler @.***> escribió:

I ran both ES-en and GB, but the parser didnt much like the format for GB. ES-en worked just fine though, and is now available at the same URL as the others.

Are AT and CZ ready to go?

Here's the updated table: Set Ready Processed Packaged AT-en No BA-en Yes Yes Yes BE-en Yes Yes Yes BG-en Yes Yes Yes CZ-en No DK-en Yes Yes Yes EE-en Yes Yes Yes ES-en Yes Yes Yes ES-CT-en Yes Yes Yes ES-GA-en Yes Yes Yes FR-en Yes Yes Yes GB Yes Failing GR-en Yes Yes Yes HR-en Yes Yes Yes HU-en No IS-en Yes Yes Yes IT-en Yes Yes Yes LV-en Yes Yes Yes NL-en Yes Yes Yes NO-en Yes Yes Yes PL-en Yes Yes Yes PT-en Yes Yes Yes RS-en Yes Yes Yes SE-en Yes Yes Yes SI-en Yes Yes Yes TR-en No UA-en No FI-en No LT-en No

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/204#issuecomment-1720262339, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREQIBC4B7ACY7J7U3X3X2OEA5ANCNFSM5SCVFTXA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TomazErjavec commented 11 months ago

@JohnVidler, getting back to this after the hiatus with collecting the last corpora:

perayson commented 11 months ago

@TomazErjavec John doesn't have access to that folder, so he's returning files on http://ucrel-api-01.lancaster.ac.uk/vidler/ instead as mentioned above. Thanks for the AT and CZ files. Are there any more still to come through?

TomazErjavec commented 11 months ago

@TomazErjavec John doesn't have access to that folder, so he's returning files on http://ucrel-api-01.lancaster.ac.uk/vidler/ instead as mentioned above.

Thanks and sorry about this. I now changed the download location and ran my program to unpack. Unfortunatelly, all the files there, unlike yours, unpack to mnt/zfs/ucrel-data/, so I need to change it a bit. Will manage, I'm sure.

Thanks for the AT and CZ files. Are there any more still to come through?

Yes: HU, UA, TR, FI, bit they need to be translated first, will let you know as soon as they are.

JohnVidler commented 11 months ago

Hey folks, sorry for the delay, been dealing with a major issue on another project.

Btw, what do the three asterix dign mean in ES-en?

Just highlighting what had changed, with the table being so large - sorry, should have said :)

Unfortunatelly, all the files there, unlike yours, unpack to mnt/zfs/ucrel-data/, so I need to change it a bit

.. that's odd - I've got tar set to build relative paths, will look to fix this on the next build, then to repackage the existing ones with the fixed paths.

I'm still a little maxed out on the other project, but I'll see if I can get AT and CZ running overnight, along with the path fix.

TomazErjavec commented 11 months ago

Hey folks, sorry for the delay, been dealing with a major issue on another project.

Not a problem, we are busy with finishing the original langauge corpora anyway.

I've got tar set to build relative paths, will look to fix this on the next build

No need, I've got my unpacking set up the way it is now, so it would only mean I have to change things at my end again.

One thing: I tried running my scripts over BA, and after everything crashed, found out that one of your CoNLL-U files abruptly terminates in the middle of the original file, this one: ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu No idea what goes wrong exactly with this one, at first I thought it was because of "==" in the text, which is rather an unusual combination of chars, but others have this as well. Under the assumption that other truncated files would also end with the # text = line I made a script that checks for this, but the file above was the only one it finds. So, fingers crossed that this is indeed the only bad file, but I can only really tell when I try to merge CoNLL-U into the XML files.

Anyway, could you re-annotate ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu please?

TomazErjavec commented 11 months ago

In addition to AT and CZ, some other MTed files are now also ready:

FI is still to come, will be finished shortly.

And, yes, I need a newly annotated ParlaMint-BA-en.conllu/2006/ParlaMint-BA-en_2006-09-18-0.conllu

TomazErjavec commented 11 months ago

FI is now also available: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-FI-en.conllu.zip Note that we discovered in the MT process that a lot (about 7%) sentences are in fact in Swedish, although not marked as such. As the MT model expects Finnish, these sentences are untranslated. I guess USAS will put the tag for unknown here.

JohnVidler commented 11 months ago

I just tried kicking off the UA and HU jobs, but the zip files seem to be malformed?

Archive:  ParlaMint-HU-en.conllu.zip
   creating: ParlaMint-HU-en.conllu/
   creating: ParlaMint-HU-en.conllu/2014/
  inflating: ParlaMint-HU-en.conllu/2014/ParlaMint-HU-en_2014-05-10.conllu
error: invalid zip file with overlapped components (possible zip bomb)

With equivalent output for the UA file.

Can you take a look @TomazErjavec ?

TomazErjavec commented 11 months ago

the zip files seem to be malformed?

I just tried getting the file and unzipping it on my machine, and it works fine, also for UA and FI. Weird.

Anyway, I now made .tgz files for FI, HU, UA, hope that will be better. Same location as before, i.e https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/?C=M;O=D

JohnVidler commented 11 months ago

Super weird, I re-downloaded the zips to check again and they seem to be happy now?

In any case, just a quick note here to say I've been running everything at our side here, and it all looks good bar one issue where the script isn't super happy affecting a single file. I'm currently investigating that and hope to have everything published tonight/tomorrow in the web folder.

As we have a few versions of these archives kicking around now, I'm also generating an md5 for each of the files, so you can confirm you have the latest/correct version.

JohnVidler commented 11 months ago

I've processed and uploaded a new set of tar's over at http://ucrel-api-01.lancaster.ac.uk/vidler/ - the only failing file is ParlaMint-BA-en_2006-09-07-0.conllu which has a processing error (The 'Translated' field is being interpreted as 'None' which is breaking the output causing the script to crash on that one) and I'll sink some time into that tomorrow and get it reuploaded.

For now, I suggest not using the BA-en files.

There's also now an archives.md5 which includes md5 hashes for each file, to allow integrity checks, in case they're required.

I've fixed the tar enclosed paths @TomazErjavec so I'm afraid your programs will need reverting to their previous paths - it was a bug that needed fixing as any changes to our build system here would mean a different path in the resultant .tar - sorry!

Also, according to my load tests - I can now apparently rebuild this lot over about a 24-48hr period no problem at all 🙂

TomazErjavec commented 11 months ago

Thanks @JohnVidler, got the missing files. I don't see any particular need to use the checksums, unless something freaky happens again. And good to hear that speed is not an issue. So:

JohnVidler commented 11 months ago

No problem, I'll set ES-CT going in a moment, and I'll be looking at GA and GB today, so they should be up shortly, barring any major issue

Edit: Ah, GB got missed because it didn't follow the 'XX-en' pattern I was using to automatically download everything - whoops. Getting that started too.

TomazErjavec commented 11 months ago

No problem, I'll set ES-CT going in a moment, and I'll be looking at GA and GB today, so they should be up shortly, barring any major issue Edit: Ah, GB got missed because it didn't follow the 'XX-en' pattern I was using to automatically download everything - whoops. Getting that started too.

@JohnVidler, and news on GB and ES-CT? As well as on the missing BA file?

And we now also got the last corpus translated, if you could process this one as well please: https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ParlaMint-ES-PV-en.conllu.zip

JohnVidler commented 10 months ago

Hey @TomazErjavec - disruption to my working pattern slowed me down - ES-CT is now in the usual spot: ( http://ucrel-api-01.lancaster.ac.uk/vidler/ ) GB is mostly playing well with the tooling here now, but I've got a couple of errors I'm fixing today so that should be up shortly.

I'll kick off ES-PV now and it'll be up for tomorrow morning, assuming we hit no problems.

TomazErjavec commented 10 months ago

Thanks @JohnVidler for ES-CT. Some problems, because a) it seems ES-CT actually deleted some files from the first round and b) you seem to have expanded the new corpus into the old directory, so those files persisted there. The result was havoc with my integration program, but managed to identify the spurious files and delete them, so now all ok. And looking forward to the final corpora!

TomazErjavec commented 10 months ago

Heads up @JohnVidler, we are now running very late. We still need:

We would really need to release ParlaMint-en soon, and the processing at this end takes some time too (assuming no problems, otherwise even more...)

JohnVidler commented 10 months ago

@TomazErjavec

GB to follow shortly, apologies for the delay!

JohnVidler commented 10 months ago

Note that GB has rather large log files, as the tooling repeatedly complains about the missing sources - I've left these in for now as the output still needs to be sanity checked by @perayson, but I've uploaded the .tar.gz anyway so you can get started @TomazErjavec on the assumption that all is well.

If the log size is a problem, let me know and I'll strip the warnings out and re-upload.

TomazErjavec commented 10 months ago

@JohnVidler, thanks for corpora. Got them all 3, at first glance looks ok (but do have to comment on the inventiveness of the paths, ES-PV and BA in mnt/zfs/ucrel-data, and GB in home/ubuntu/:). Logs are not a problem. But if @perayson finds problems with GB pls. let me know before I start processing it. And thanks for your work!

perayson commented 10 months ago

I've had a look at GB this morning, the semantic tagging looks fine, however we never really agreed an input/output format for GB as it's different from the translated corpora. Can you have a look @TomazErjavec and let us know what else needs to be retained, if anything, from the input?

JohnVidler commented 10 months ago

Argh, apologies for the path mixup - I had to run GB on its own, hence the different path, but the darned version of tar on the box seems to ignore the -C directive to handle non-absolute paths. I've rebuilt the files and can update the published ones with the corrected paths if this is simpler.

TomazErjavec commented 10 months ago

I've rebuilt the files and can update the published ones with the corrected paths if this is simpler.

No @JohnVidler, it's ok, I have all the files now the way I want them here. But I though I should mention it!

I've had a look at GB this morning, the semantic tagging looks fine, however we never really agreed an input/output format for GB as it's different from the translated corpora. Can you have a look @TomazErjavec and let us know what else needs to be retained, if anything, from the input?

@perayson, I think it's ok the way it is. I did some pre-processing and nothing broke. So, I think we can consider the delivery of all the files done! (well, except if some later stage of processing, in particular the conversion into TEI, shows some unexpected problems, but I am optimistic that it won't).

perayson commented 10 months ago

ok, great, thanks for confirming!

TomazErjavec commented 10 months ago

The points here have been mostly solved, what remains should be taken up in #827.