EBIvariation / CMAT

ClinVar Mapping and Annotation Toolkit
Apache License 2.0
17 stars 10 forks source link

Manual curation for 2023.12 release #401

Closed apriltuesday closed 8 months ago

apriltuesday commented 9 months ago

Refer to documentation for full description of steps.

Checklist:

apriltuesday commented 9 months ago

@M-casado @tcezard As the submission window is earlier than we expected, this will have to be a quick one... If you don't have time in the next week or so please feel free to ignore this, we will at least have the updated automatic mappings.

If you do have time, the spreadsheet is here.

Because we did not get a chance to do the massive import from July, I tried to mark terms appearing in that list as SKIP so that you can filter them out and focus on other things. Hopefully that makes sense.

M-casado commented 9 months ago

@apriltuesday I am curating the terms and found some things that I wanted to check with you:

These last ones make me wonder if the replacement finding part of the process is working as we expect it to.

M-casado commented 9 months ago

Ready for review 128 DONE 778 SKIP 47 IMPORT 3 UNSURE 5468 Blank

M-casado commented 9 months ago

Bear in mind it's my first time doing these steps of the curation, so something may not look like it should (especially comparing some DONE numbers with regards to previous rounds).

I followed step by step the rubric, though, and filled a few blanks, but the ones that had a higher ClinVar Freq were already ringing a bell, and I assume they've been there round after round.

apriltuesday commented 9 months ago

Thanks Marcos, I need to check on a few of your comments but here are some quick answers:

The comment's column seems to have some automatic populated formula. I'll have to remove it for the terms I want to leave a comment at.

Yes sorry, I added that as a temporary measure to skip the July terms and add an explanatory comment, please remove it for any Comment and Status cells as needed. It won't be there in the future (hopefully).

We are to add new import terms to the 2023-07-24 Add EFO disease tab, since we didn't do it last time, right?

You shouldn't do anything for either Add EFO disease tab, those are filled by the script after the manual curation.

I followed step by step the rubric

This is the right thing to do, as much as possible we should find shortfalls in the rubric and fix them in the documentation directly. It's more painful in the short-term but will serve us better in the long run IMHO.

apriltuesday commented 9 months ago

I have found suggested replacement mappings like MONDO:0018875||NOT_SPECIFIED|replacement|NOT_CONTAINED that:

  • Instead of having the URL as the first element, have the CURIE
  • Even though the term says NOT_CONTAINED, when I checked the term in EFO, it was indeed there.

These are related, basically the format that we get in the replacement term field from the EFO API is inconsistent, so the code is also not able to check EFO containment properly. We should be able to fix the code to handle most of these cases, but there might always be exceptions as I think that field is manually filled by someone.

For example I found the term MONDO:0017138 as a replacement (imported from MONDO) of the obsolete MONDO:0007779, but the former was not listed as a replacement in the spreadsheet.

The replacement process specifically looks for this "term replaced by" annotation (example from MONDO_0007903):

image

I don't see that annotation in your example which would be why it isn't picked up. We could ask SPOT whether there's another annotation we could be using.

I see many Notes from the previous round not only on the UNSURE, but also on the Blanks. To avoid losing them, should we copy-paste them manually to this round's Comments?

I guess so, it's really not ideal though... it would really be better if we could focus on curating terms rather than comments, but I think your proposal makes sense for now.

I think I've captured your feedback in #402 (for issues we can address in the short-term, at least), let me know if I've missed something.

M-casado commented 9 months ago

Hi @apriltuesday, thanks for the responses 👍

The replacement process specifically looks for this "term replaced by" annotation (example from MONDO_0007903):

I think it may be just this case where the problem was not on our plate, I reported it yesterday here, I think it's just that their chosen replacement is not the best fit for purpose, most likely.

Just to emphasize: both @tcezard and @apriltuesday, double check the numbers of DONE, because in previous iterations I saw thousands, but now not even adding the SKIP we seemed to get to that amount.

tcezard commented 8 months ago

I've done a quick review of the DONE. I think the IMPORT we posted in July have not made it through to EFO yet. I check a few of them and did not see them. I added a few IMPORT and was surprised to see many with exact matches from MONDO but they were not reported by ZOOMA. It. might something we need to check before the next manual curation.

apriltuesday commented 8 months ago

I think the IMPORT we posted in July have not made it through to EFO yet.

We didn't actually submit the July import since we wanted to check with SPOT first, so this is expected... I'll add the terms for this round.

I added a few IMPORT and was surprised to see many with exact matches from MONDO but they were not reported by ZOOMA.

Added to #402.

apriltuesday commented 8 months ago

EFO issue: EBISPOT/efo#2109

This includes imports from this round and July, so I'll close both.

FYI I did the splitting of the sheets into import/new completely manually for now, I think we should look at improving the script though (probably in #391)