EsmaeilNourani / lifestylefactors-annotation-docs

Apache License 2.0
0 stars 1 forks source link

Consolidated table of the manually annotated disease-LSF relations in LSD600 #1

Closed dhimmel closed 1 week ago

dhimmel commented 1 month ago

Greetings, I'm reviewing the preprint LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations. I am excited to look at the 1900 disease to lifestyle factor relationships as per:

The 600 abstracts that make up the corpus were randomly partitioned on the document-level into a training set (60%), a development set (20%), and a held-out test set (20%). The corpus contains a total of 1900 manually annotated relations, which are distributed over eight relation types that are organized into a hierarchy.

From the Zenodo, I downloaded LSD600.tar.gz and began looking at the .txt files with abstracts and .ann files that appear to be BRAT standoff format containing annotated named entities and relations.

Many of the early .ann files only contained entities and no relationships, but I stumbled upon 34621627.ann containing both. Snippet below:

T1  Disease 13 26   Skin Diseases
T16 Lifestyle_factor 2167 2172  shift
T17 Disease 2198 2211   skin diseases
T18 Lifestyle_factor 1905 1908  PPE
T25 Disease 248 269 occupational diseases
T31 Out-of-scope 438 441    may
T32 Lifestyle_factor 1881 1897  safety trainings
R1  negative_statistical_association Arg1:T18 Arg2:T13  
T35 Lifestyle_factor 2135 2150  safety training
R6  negative_statistical_association Arg1:T35 Arg2:T17  
T36 Lifestyle_factor 2307 2313  gloves
R7  Prevents Arg1:T19 Arg2:T17  
R8  Prevents Arg1:T36 Arg2:T17  

This preamble is for orientation and to demonstrate that's its not easy for the reader to manually inspect all 1900 relations.

My suggestion is to create a consolidated dataset with one record per manually curated relation across all 600 abstracts. This would not replace the txt/ann files but be a useful access point for users to immediately inspect the data. Suggested fields are:

JSON, TSV, or excel would all make a reasonable format choice for this. Column names are greatly appreciated!

This would help me get a much better appreciation for the types and quality of relationships annotated by LSD600.

dhimmel commented 1 month ago

Just wanted to note that Figure 1 is a nice graphical representation of these relationships

Illustration of the eight LSF–disease relation types in LSD600.

The dataset would offer more fields and access to all 1900 relations.

EsmaeilNourani commented 4 weeks ago

Thank you for your valuable input!

The requested table is now available. You can download it from the following link:

🔗 Consolidated Relations Dataset

The table includes the following columns:

Feel free to reach out if you have any further suggestions or questions.

dhimmel commented 4 weeks ago

Thanks @EsmaeilNourani. Very helpful.

I see the data was added in https://github.com/EsmaeilNourani/LSF_Disease_RE/commit/9895a5e0ffc6d2eddc349204750fc702a95790da. Would it make sense to also commit the code that generates this data?

dhimmel commented 3 weeks ago

The contents of Consolidated_Relations_Dataset.tsv look good to me. Nicely executed.

It would be ideal to archive this table in the Zenodo deposit so it lives with the rest of the data release.

Once code is available to generate this dataset and the data is on Zenodo, I will close this issue to denote that the request has been resolved. Same applies to https://github.com/EsmaeilNourani/lifestylefactors-annotation-docs/issues/2.

dhimmel commented 3 weeks ago

FYI the full text of my review is online here. Thanks for posting a preprint such that it is possible for me to immediately share my review. Cheers.

dhimmel commented 1 week ago

Consolidated_Relations_Dataset.tsv is on the Zenodo and the contents look good. The code to generate is at helpers/create_consolidated_relations.py. Thanks @EsmaeilNourani