bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
458 stars 116 forks source link

Create dataset loader for Chemical exposure assessments #244

Open jason-fries opened 2 years ago

jason-fries commented 2 years ago

Adding a Dataset

debajyotidatta commented 2 years ago

self-assign

nbroad1881 commented 2 years ago

I'm not really sure how to process this one. Here is what is in corpus.zip: two folders called class and txt Here are the contents of 7481741.txt in both class and txt respectively:

<exposure routes--oral intake--food--<exposure routes--oral intake--food--<exposure routes--oral intake--food--< Biomonitoring--exposure biomarker--hair nail--< Biomonitoring--exposure biomarker--hair nail--<

A study was conducted to examine human exposure to mercury through dietary mercury intake in a population living in an industrially non-polluted area of the Adriatic Sea . The results have shown that approximately 20% of the subjects had a weekly dietary mercury intake above the provisional tolerable weekly intake ( PTWI ) , primarily those consuming fish and other seafood > 6 times/week . The estimated seafood consumption corresponding to a mean intake of PTWI of 300 micrograms total mercury was 1559 g , and 1365 g for a PTWI of 200 micrograms methylmercury . However , the total mercury content in hair in individuals consuming total mercury above the PTWI was in the range of 1.3-12.9 micrograms/g , whereas the methylmercury content in hair in subjects consuming methylmercury above the PTWI was between 1.1-10.8 micrograms . Thus , the mercury content in hair did not reach the critical level at which toxic effects of mercury could be expected . The results , particularly those related to methylmercury exposure , did not differ significantly from data reported earlier from an industrially polluted area , thus indicating that the mercury content of fish and consequent human exposure to mercury reflects primarily the general ecological characteristics of the Adriatic , rather than the impact upon a specific local pollution .

The corpus_preprocessed.zip file contains many folders (NER, pos, lem, parse). Here is 7481741.ner.txt

PROTEIN: Adriatic Sea \~~~ \~~~ \~~~ PROTEIN: PTWI \~~~ \~~~ \~~~

Here is the beginning of 7481741.pos.txt

A DT study NN was VBD conducted VBN to TO examine VB human JJ exposure NN to TO mercury NN through IN dietary JJ mercury NN intake NN in IN a DT population NN

And the beginning of 7481741.parse.txt

(det study_1 A0) (ncmod exposure_7 human_6) (dobj to_8 mercury9) (ncmod exposure_7 to_8) (dobj examine_5 exposure7) (ncmod intake_13 mercury12) (ncmod intake_13 dietary_11) (dobj through_10 intake13) (ncmod examine_5 through_10) (det population_16 a15) (ncmod non-polluted_21 industrially20) (ncmod area_22 non-polluted_21)

Beginning of 7481741.lem.txt

A DT a study NN study was VBD be conducted VBN conduct to TO to examine VB examine human JJ human exposure NN exposure to TO to mercury NN mercury through IN through dietary JJ dietary mercury NN mercury intake NN intake

Any suggestions?

hakunanatasha commented 2 years ago

@nbroad1881 can you ping the discord?

nbroad1881 commented 2 years ago

@leonweber I don't plan on implementing this. I was just exploring it

leonweber commented 2 years ago

Oh sorry, my bad!