Sentence formatting and tokenisation

antonyscerri commented 3 years ago

Hi

Could you provide a little detail on the pre tokenisation that appears to have been done on the sentence text in the jsonl files, as I couldn't find any mention or code that appears to deal with that step. Feeding in a raw paragraph produced a key error in the initial entity mentioned detection step that was due to an escape newline in the text. Removing the newline it got passed that step but then seemed to fail when merging subsentence, I tried adding spaces between various token which still resulted in "AssertionError: Sent -1, Al -1".

I noted there was code to window over the "sentence", however it seems anything over a certain length causes it to fail. Otherwise it seems to be splitting up punctuation and all lowercased but would like to know if there is anything else to take care of.

Could you also comment on whether you have any idea how it could react to being given longer passages of text (for predictions against your models or future training)?

Thanks

Tony

lorr1 commented 3 years ago

Hello Tony!

Happy to help here. So all of our data is preprocessed from Wikipedia using nltk word tokenizaton and then joining back with a single space.

E.g., is sentence is a single sentence without an ending period, we run sent = " ".join(nltk.word_tokenize(sent)) + " ."

It'd help if I could see an example of the paragraph you're trying to pass through. Can you share that with me? My guess is there is a subtle punctuation issue that's causing problems. If I can take a look, I can push a fix through or help you process it in a way that works.

antonyscerri commented 3 years ago

Hey

Thanks for the clarifying that. I assume staring with a passage of text we should use NLTK to sentence split as well.

I checked my hand formatted example by using NLTK. So the following works:

{"sent_idx_unq": "0", "sentence": "bisphenol a removal by the halophyte juncus acutus in a phytoremediation pilot characterization and potential role of the endophytic community"}

If i then add a few more words (random bit of text) this will fail with the error we were seeing:

{"sent_idx_unq": "0", "sentence": "bisphenol a removal by the halophyte juncus acutus in a phytoremediation pilot characterization and potential role of the endophytic community in space and contaminated groundwater"}

Let me know if that reproduces the issue on your end.

Thanks

Tony

antonyscerri commented 3 years ago

Sorry me again, the NLTK word tokenizer doesn't output lower cased text but in the 50 sentence wiki example they all appear lower case and with no terminating punctuation. So should i do some additional stripping?

antonyscerri commented 3 years ago

A couple of more comments on the input file:

1) sent_idx_unq look like it must be a numeric value with zero index and must be contiguous (I believe this explained a few Key Errors i'd seen) 2) it also looks like if you have a single sentence and it doesnt have an entity mention then you get an empty file which it fails to mmap later on

I've been iterating over a small sentence trying to work out why it is failing and those are just some observations from trying different things. I found that just taking one of the examples from the wiki 50 samples also causes a problem. So just putting this in alone causes a problem, slightly different error to other single sentence example i've tried.

{"sentence": "who did the voice of the magician in frosty the snowman", "sent_idx_unq": 0, "aliases": ["frosty the snowman"], "spans": [[8, 11]], "qids": ["Q5506238"], "gold": [true]}

Are there any issues with some data being cached that might need cleaning out?

Thanks

lorr1 commented 3 years ago

Hello!

To answer your questions, you don't need to have lowercased text or text without punctuation. The nq data is a small sample from natural questions. This data is already lowercased and without punctuation. As our model is trained over Wikipedia, it can handle cased words.

So I don't think sent_idx_unq needs to be contiguous from 0. You can take a random sample of a dataset and run eval without a problem. They do need to be unique from each other, otherwise, it can throw errors.

I did find a bug in the load_mentions functions due to the empty mentions. Thanks for catching that. I'll push a fix soon.

I have had some issues recreating your bug with the sentences. So, i made a small file with the three sentences

{"sent_idx_unq": "0", "sentence": "bisphenol a removal by the halophyte juncus acutus in a phytoremediation pilot characterization and potential role of the endophytic community in space and contaminated groundwater"}

{"sent_idx_unq": "1", "sentence": "bisphenol a removal by the halophyte juncus acutus in a phytoremediation pilot characterization and potential role of the endophytic community"}

{"sent_idx_unq": "2", "sentence": "bldsslkdghsadkjf"}

I then call extract_mentions and then run the model to with mode dump_preds. The label file outputted has all thee sentences (the last one having no mentions). I haven't gotten any errors with doing this.

Can you explain in more detail the steps you are going through?

antonyscerri commented 3 years ago

hey

Thanks for checking, we got another machine setup to check and that seemed to work ok. So it looks like it may be something in the environment causing it, so will be going through that to check. Thanks for letting me know about the case i just wanted to make sure in case it effected the results.

I will send an update if i trace what is wrong (there was an issue with disk space at one point during the install so maybe something got corrupted, although i did start over). On a side note using the Annotator class it was able to process the sentence I was having problems with so it seemed like it should work.

Thanks

Tony

antonyscerri commented 3 years ago

Hi

Ok i got another environment going and got it going. Then got it to produce an error which lead me (hopefully) to the answer. There are a set of intermediate files produced which dont get rebuilt, even if you edit the input file. So the sequence to reproduce the error is:

1) Create sample with short sentence (first sample in my second post above should do) 2) Run the end to end using that file 3) Edit the file changing the sentence to the longer form (second example in my second post above) 4) Re-run the end to end and it should produce the error

If between steps 2 and 4 you delete the "prep" subfolder in the folder where the input file is located it will be ok. I believe this may be part of the dataloader setup you are using, as i had some debug in there which i wasnt seeing in the log till i moved environments and began running different input filenames and then noticed it.

So this may not be true bug, just undocumented behaviour. Worth noting though i have been editing the input file without noticing the assertion error but this raises the question of whether the outputs would actually be correct if its reusing these files.

Tony

antonyscerri commented 3 years ago

Just noted that you do have the warning in the basic training guide which matches my last comment, it may want highlighting elsewhere.

lorr1 commented 3 years ago

This is great feedback. Thank you! Yes, we store prepped files for faster loading. We will highlight this more in the tutorial/docs and instructions so it is more clear what is happening. You can turn on an overwrite flag, but it's not on by default.

Let me know if you run into any other issues! I'll push a PR with update instructions in the next day or two.

antonyscerri commented 3 years ago

Good to know about the overwrite flag as well.

Thanks

Sent from my iPhone

On 30 Jan 2021, at 07:21, lorr1 notifications@github.com wrote:

External email: use caution

This is great feedback. Thank you! Yes, we store prepped files for faster loading. We will highlight this more in the tutorial/docs and instructions so it is more clear what is happening. You can turn on an overwrite flag, but it's not on by default.

Let me know if you run into any other issues! I'll push a PR with update instructions in the next day or two.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FHazyResearch%2Fbootleg%2Fissues%2F19%23issuecomment-770171205&data=04%7C01%7Ca.scerri%40elsevier.com%7C82d6fde89ca94d7d578c08d8c4efafe8%7C9274ee3f94254109a27f9fb15c10675d%7C0%7C0%7C637475881030029796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=psICc09y8xA0eJRKAn6cGOZ2aS7PheQRtGT3uuxjRMs%3D&reserved=0, or unsubscribehttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAGXEJU6MSR2EWHWHRZZUJTS4OXQHANCNFSM4WWNIYQA&data=04%7C01%7Ca.scerri%40elsevier.com%7C82d6fde89ca94d7d578c08d8c4efafe8%7C9274ee3f94254109a27f9fb15c10675d%7C0%7C0%7C637475881030039791%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=9hUeMTzo%2FL%2FKHr1GAhzH7vaT10poPf9N0SwM16wjoQc%3D&reserved=0.

Elsevier Limited. Registered Office: The Boulevard, Langford Lane, Kidlington, Oxford, OX5 1GB, United Kingdom, Registration No. 1982084, Registered in England and Wales.

antonyscerri commented 3 years ago

Quick note on the overwrite flag, if this is "overwrite_preprocessed_data" then it seems to cause it to also preprocess a bunch of the wiki entity related data files too which i'm not sure if they really need to be touched?

lorr1 commented 3 years ago

Yes. This is correct. The flag overwrites and reprocesses all prepped data. We felt like most users wouldn't need fine grained control over what was reprepped or not, but if it is a common use case, we can certainly add more fine-grained "reprep" flags.

Also, we merged the mention extraction bug fix and added better instructions in our tutorial around when data needs to be preppred (https://github.com/HazyResearch/bootleg/pull/21).

HazyResearch / bootleg

Sentence formatting and tokenisation #19