Rostlab / LocText

Relation Extraction (RE) of: Proteins <--> Cell Compartments
https://www.tagtog.net/-corpora/LocText
Apache License 2.0
5 stars 2 forks source link

‼️Fill in missing normalizations in interFile_modified.tsv file #27

Closed juanmirocks closed 7 years ago

juanmirocks commented 7 years ago

Dear Tanya,

I’m getting much closer wrt LocText.

Only, I found that many documents (39 out of the total 100) had proteins without being normalized to UniProt ids. At first I thought it was only a few cases so I was ignoring it. But now I’m seeing that this is affecting performance quite considerably.

I manually checked a few documents and it seems that most proteins can indeed be normalized.

Would it be possible for you to please fill in the missing normalization ids?

You would simply do it in the attached file (interFile_modified). You can use the tagtog interface to help you read the documents: https://www.tagtog.net/jmcejuela/LocText/pool

The proteins that have no normalization in the .tsv have in the corresponding column NormalizedID only written “Protein”.

It’s possible to find the set of the pmids docs with:

grep "Protein\s*Protein" interFile_modified.tsv | grep -o "^[0-9]*\s" | sort | uniq

In particular, on tagtog you can search a pmid by searching on top, e.g., docid:10072396

Will you be able please to do this?

The tasks to be completed as part of this issue

Necessary Information wrt to this task: 1) The folder name containing latest documents : LocText_annjson_with_normalizations_latest_5_feb_2017 2) The file name containing latest changes from Tanya: interFile_modified_by_Tanya 3) Latest code has been checked into git under: 8_StringTaggerUniprot branch.

goldbergtatyana commented 7 years ago

I am afraid that the normalization for the remaining proteins is not possible. I checked the fisrt 15 proteins. For them, this is the case. The normalization is not possible for example, because: 1) protein name is too generic (e.g. Ferrochelatase from arabidopsis, pmid: 9346891; the protein has different subforms and so its unclear which subform is meant here) 2) is not in swisspot (e.g. EDTA from pea, pmid: 9346891)

goldbergtatyana commented 7 years ago

proteins like "green fluorescent protein" or "GFP" we said in the rules wont be related to an organism. So, they stay unnormalized as well.

goldbergtatyana commented 7 years ago

it is also the case that proteins in an abstract dont have any relation to an organism => no normalization possible. Example: pmid 11038182

juanmirocks commented 7 years ago

@goldbergtatyana awesome, thanks for this!

I did check randomly two documents, and at first sight, they appeared to me that they indeed can be normalized.

Could you please share the exact list of 15 documents? So maybe I test the other a bit more thoroughly.

P.D.: good job hustling on Sunday

goldbergtatyana commented 7 years ago

no_norm.xlsx

goldbergtatyana commented 7 years ago

Now I'm through the whole list and here is what comes out of it. Red are marked all proteins for which there is no normalization can be made. Yellow - normalization could in principle be made, but I need to know which mentions the coordinates match. Green - normalization was made.

interFile_modified_tg_dec182016.xlsx

juanmirocks commented 7 years ago

Awesome. I check the yellow tmrw. Good night darling On Sun, 18 Dec 2016 at 18:17, Tatyana Goldberg notifications@github.com wrote:

Now I'm through the whole list and here is what comes out of it. Red are marked all proteins for which there is no normalization can be made. Yellow

  • normalization could in principle be made, but I need to know which mentions the coordinates match. Green - normalization was made.

interFile_modified_tg_dec182016.xlsx https://github.com/juanmirocks/LocText/files/659536/interFile_modified_tg_dec182016.xlsx

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/juanmirocks/LocText/issues/27#issuecomment-267833596, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGQH8apeOncm9iXV_BicUZcu_Mwqh8tks5rJWqlgaJpZM4LOkKy .

juanmirocks commented 7 years ago

Hi there,

juanmirocks commented 7 years ago

@goldbergtatyana Perhaps you can quickly skim over the tagtog documents that have non normalized proteins

In my view, most of them are normalizable.

Btw, you can use the shortcuts [ and ] to nagivate to previous and next documents, respectively, 😀

juanmirocks commented 7 years ago

Furthermore, other documents that really do not cite any organism, likely implicitly assume human.

Here for example, https://www.tagtog.net/jmcejuela/LocText/-search/origid:15342390/apR4tcyiJeOQUQKqEhe_LAUHRAYq-15342390?p=0&i=0

For instance, I guess that "CITED4" is this human protein: http://www.uniprot.org/uniprot/Q99MA0

I don't remember what we raid in the guidelines regarding the normalization of these proteins. But it would not be far off to assume that for many of these cases the missing organism is human

goldbergtatyana commented 7 years ago

hi Mishka,

I looked into the paper of the shugoshin and I find there that they talk about both versions of the protein. Therefore, assigning the id of protein version 1 only would be wrong in my opinion. Also, I think it is out of scope of annotation going into the paper for figuring what protein in the abstract is about.

goldbergtatyana commented 7 years ago

Now to CITED4: looked into the paper of that protein as well and paper talks about both human and mouse version of it. So, blindly assigning human to non-normalized proteins would be wrong in my opinion as well.

goldbergtatyana commented 7 years ago

The with yellow marked proteins have been normalized now and the newest version of the normalization file is attached. interFile_modified_tg_jan052017.xlsx

juanmirocks commented 7 years ago

Yes, I agree it's out of scope reading the full paper.

However, I think the protein can still be normalized to either one arbitrary (since I assume they have high sequence similarity, then is irrelevant for final comparison) or even better normalize to both, separating the ids with commas. On Thu, 5 Jan 2017 at 11:58, Tatyana Goldberg notifications@github.com wrote:

hi Mishka,

I looked into the paper https://www.researchgate.net/publication/7237704_Shugoshin_collaborates_with_protein_phosphatase_2A_to_protect_cohesin of the shugoshin and I find there that they talk about both versions of the protein. Therefore, assigning the id of protein version 1 only would be wrong in my opinion. Also, I think it is out of scope of annotation going into the paper for figuring what protein in the abstract is about.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/juanmirocks/LocText/issues/27#issuecomment-270618151, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGQH_Mzff4owPKRHE-UJ6AMQoA1GENvks5rPMymgaJpZM4LOkKy .

goldbergtatyana commented 7 years ago

I disagree with this brute force normalization approach. What is the goal of everything? Why are we enforcing normalization this way that will certainly introduce errors?

juanmirocks commented 7 years ago

In the end it may not introduce errors since Lars tagger produces multiple normalization alternatives as output.

In any case, I agree with you. Let's hold on annotating these cases this until I get more results. It may indeed be negligible.

Thanks for annotating the yellow cases! On Thu, 5 Jan 2017 at 12:25, Tatyana Goldberg notifications@github.com wrote:

I disagree with this brute force normalization approach. What is the goal of everything? Why are we enforcing normalization this way that will certainly introduce errors?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/juanmirocks/LocText/issues/27#issuecomment-270622989, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGQH-4TMijb0rWtcaKTJsHTY6pYzOr3ks5rPNL2gaJpZM4LOkKy .

MadhukarSP commented 7 years ago

Reference link : https://github.com/juanmirocks/LocText/blob/develop/resources/corpora/LocText/interFile_modified.tsv

Above link points to interFile_modified.tsv, as per this issue we are supposed to get the NormalizedID for Protein [Which are currently marked red].

These NormalizedID can also be filled, if we make use of StringTagger, by following below steps. 1) We could send the Entity name as text and type as Human (In case we are assuming that the organism is Human) and receive the JSON response with corresponding Normalized StringID and Uniprot ID. 2) With the Uniprot ID from the response, we fill the corresponding cells in the interFile_modified.tsv document.

juanmirocks commented 7 years ago

I let you know soon. Jsonld may suffice On Tue, 31 Jan 2017 at 18:12, Madhukar notifications@github.com wrote:

Reference link : https://github.com/juanmirocks/LocText/blob/develop/resources/corpora/LocText/interFile_modified.tsv

Above link points to interFile_modified.tsv, as per this issue we are supposed to get the NormalizedID for Protein [Which are currently marked red].

These NormalizedID can also be filled, if we make use of StringTagger, by following below steps.

  1. We could send the Entity name as text and type as Human (In case we are assuming that the organism is Human) and receive the JSON response with corresponding Normalized StringID and Uniprot ID.
  2. With the Uniprot ID from the response, we fill the corresponding cells in the interFile_modified.tsv document.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/juanmirocks/LocText/issues/27#issuecomment-276427511, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGQHzJwFx7Jj49iS1admPzJEevcKUvGks5rX2uKgaJpZM4LOkKy .

MadhukarSP commented 7 years ago

The tasks to be completed:

1) Use Tanya new document with Green and red ones. 2) Create new folder for new documents in resources. 3) new interFile_modified_by_Tanya.tsv 4) Change read corpus to new path.

MadhukarSP commented 7 years ago

Task complete.

Necessary Information wrt to this task:

The folder name containing latest documents : LocText_annjson_with_normalizations_latest_5_feb_2017 The file name containing latest changes from Tanya: interFile_modified_by_Tanya Latest code has been checked into git under: 8_StringTaggerUniprot branch.

MadhukarSP commented 7 years ago

For our records:

The xlsx file from Tanya had a minor mistake. One of the record which had Uniprot ID was replaced with string Protein. The record is below.

Original record was : 11502169 Sdh1p Protein Q00711 abs 786 791

In interFile_modified_tg_jan052017.xlsx file, the same record was set as below by mistake. 11502169 Sdh1p Protein Protein abs 786 791

Hence we got wrong count of newly normalized proteins. [Green ones].

The new interFile_modified_by_Tanya.tsv file has been updates with correct value.

shpendm commented 7 years ago

We also created a helper file 'helper_for_tsv_and_xlsx.py', for the cases when we have to deal with:

  1. Conversion of xlsx to tsv file
  2. Creation of a new tsv file that contains only normalized rows
  3. Finding of differences between two tsv files and occurrences of non-normalized rows in those files

The file is placed in: resources/corpora/Loctext/ This might be helpful for the future changes/additions of input files.

MadhukarSP commented 7 years ago

The new test file containing following assertion is in file test_num_of_normalizations.py.

"Assert that the number of normalizations in the new corpus is equal to the number of normalizations of the previous one + the number of greens — put that in the test file"