TextpressoDevelopers / textpressocentral

Textpressocentral frontend web application
Other
2 stars 2 forks source link

Combine new "Variation (C. elegans)" category words to "genetic perturbation" words to create new "genetic perturbation (C. elegans)" category #3

Open chris-grove opened 5 years ago

chris-grove commented 5 years ago

I'm not clear on the difference between the "Variation (C. elegans)" category:

Variation (C. elegans) (tpvace:0000001)

and the "allele (C. elegans)" category:

allele (C. elegans) (tpalce:0000001)

but I'd like to add the words from one of those categories to the words from the "genetic perturbation" category:

genetic perturbation (tpgp:0000001)

to create a new "genetic perturbation (C. elegans)" category.

textpresso commented 5 years ago

Github didn't allow the attachments (too big), so I send them separately to you, Chris.

M

On 2/15/19 11:20 AM, Hans-Michael Muller wrote:

These are legacy issues. We should eliminate one of them. I attached both for you to inspect. I suspect you would want the bigger one. Let me know if I should delete the older one.

Michael.

On 2/15/19 10:25 AM, Chris Grove wrote:

I'm not clear on the difference between the "Variation (C. elegans)" category:

Variation (C. elegans) (tpvace:0000001)

and the "allele (C. elegans)" category:

allele (C. elegans) (tpalce:0000001)

but I'd like to add the words from one of those categories to the words from the "genetic perturbation" category:

genetic perturbation (tpgp:0000001)

to create a new "genetic perturbation (C. elegans)" category.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AjXqEpxms-2GyUPvMnn006T550gZsZ3Yks5vNvuBgaJpZM4a-MOt.

chris-grove commented 5 years ago

@textpresso OK, I was perusing them on TPC Browse, and they seem similar but they're each > 5,000 terms. It would be good to get some input from whoever created these categories? Maybe @kyook ?

textpresso commented 5 years ago

I think I pulled them out of one of Juancarlos' postgres tables at different points of time. We can completely revamp them if you like.

Michael.

On 2/15/19 12:29 PM, Chris Grove wrote:

@textpresso https://github.com/textpresso OK, I was perusing them on TPC Browse, and they seem similar but they're each > 5,000 terms. It would be good to get some input from whoever created these categories? Maybe @kyook https://github.com/kyook ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/3#issuecomment-464188475, or mute the thread https://github.com/notifications/unsubscribe-auth/AjXqEuHs7V1uAyoxevb3puA3ifew4Aieks5vNxiQgaJpZM4a-MOt.

chris-grove commented 5 years ago

OK, so the category:

allele (C. elegans) (tpalce:0000001)

appears to be (with some exceptions) an overly redundant version of

Variation (C. elegans) (tpvace:0000001)

in which almost every term from the Variation category that is lowercase is repeated as a capitalized version in the 'allele' category. Capitalized versions of these are, I think, unnecessary as they are never (as far as I am aware) used in the capitalized form. That said, there are some distinct words in each list, regardless of capitalization. I think it would be best to throw out one ('allele' I suppose) and get a completely updated list of allele names to populate the remaining list...

(did some digging)

OK, just emailed you Michael, with a new list.

chris-grove commented 3 years ago

@goldturtle FYI, this is the ticket I was referring to on our last call

goldturtle commented 3 years ago

So what was the conclusion. Do you want to have the combined category?

Sorry about being forgetful,

Michael.

On 8/27/20 10:46 AM, Chris Grove wrote:

@goldturtle https://github.com/goldturtle FYI, this is the ticket I was referring to on our last call

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/3#issuecomment-682096121, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACB4CG7Q23AH5VGSCCFBOGDSC2LWPANCNFSM4GXYYOWQ.

goldturtle commented 3 years ago

@chris-grove, I am not so forgetful after all. I remember I had worked on this. It's just that it's only available at the new site. A C. elegans version of it can be accessed at http://textpressocentral.org:3030/tpc Michael.

chris-grove commented 3 years ago

@goldturtle Oh OK, thanks. Is it going to be pushed to production? I cannot, for example, access the same paper I've been looking at on the live TPC (WBPaper00044285) on your 3030 instance.

textpresso commented 3 years ago

Indeed. An upgrade in the underlying OS caused pdf2text conversion failures which I subsequently fixed. The markup of this instance fell between those two events. I started a new markup on Friday, and I had hoped it would be finished over the weekend, but it didn't. I'll let you know when it's finished.

M.

On 8/28/20 9:01 AM, Chris Grove wrote:

@goldturtle https://github.com/goldturtle Oh OK, thanks. Is it going to be pushed to production? I cannot, for example, access the same paper I've been looking at on the live TPC (WBPaper00044285) on your 3030 instance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/3#issuecomment-682778191, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI26UESKXVWLEKUGW5ICPNTSC7IEZANCNFSM4GXYYOWQ.

chris-grove commented 3 years ago

@goldturtle Great thanks Michael!

goldturtle commented 3 years ago

@chris-grove The new site is now available at http://textpressocentral.org:4040 It has three "different" literatures:

C. elegans Main Article and Suppl (27490 papers) C. elegans (6 papers) C. elegans Supplementals (20634 papers)

Don't be confused by C. elegans only having six papers. Those somehow refused to be merged and stayed in that literature instead of being in C. elegans Main Article and Suppl.

chris-grove commented 3 years ago

@goldturtle OK thanks! Looking good!

I do now notice that the list of alleles added include many "dead" alleles including transgene names which can result in a significant number of false positives as a "genetic perturbation". I've rerun a new Postgres query:

SELECT DISTINCT obo_name_variation FROM obo_name_variation WHERE obo_name_variation !~ 'WBVar*' AND obo_name_variation.joinkey NOT IN (SELECT joinkey FROM obo_data_variation WHERE obo_data_variation ~ 'Dead');

and put the output list in a text file here:

https://www.dropbox.com/s/7ogpp770pv6vsvl/Postgres_allele_names_in_obo_name_variation_not_dead_Sep_3_2020.txt?dl=0

which should now filter out all dead alleles (including almost all transgene names), leaving the list with 54,424 alleles (7,578 fewer than before) in total. I actually think this should be the list for the category "Variation (C. elegans) (tpvce:0000000)" but you may want to run that by Karen @kyook first.

Can we specifically add these 54,424 alleles to the "genetic perturbation (C. elegans) (tpgp:0000001)" category (instead of the original 62,002 I sent)?

goldturtle commented 3 years ago

@chris-grove Sure, I can update the category. It'll take a few days.

chris-grove commented 3 years ago

@goldturtle Great! Thanks!

goldturtle commented 3 years ago

@chris-grove I am a bit confused about the 'dead' alleles issue. The query

SELECT FROM obo_data_variation WHERE obo_data_variation ~ 'Dead' AND obo_name_variation !~ 'WBVar'

returns no results, so only WBVar are declared dead in postgres. (WBVar are already and will be excluded from the category.) However, the more important question is whether you had other dead alleles in mind. And if so, were dead alleles never mentioned in papers in the past? Then we still need to include them in the category.

chris-grove commented 3 years ago

@goldturtle Sorry for the delayed response:

The query you've entered above doesn't have a join between the "obo_data_variation" and the "obo_name_variation" tables so it won't produce results for that reason. I think the query you were after is:

SELECT * FROM obo_data_variation WHERE obo_data_variation ~ 'Dead' AND joinkey NOT IN (SELECT joinkey FROM obo_name_variation WHERE obo_name_variation ~ 'WBVar');

which produces 7,586 results. Further, filtering out transgene names like "Is" and "Si":

SELECT * FROM obo_data_variation WHERE obo_data_variation ~ 'Dead' AND joinkey NOT IN (SELECT joinkey FROM obo_name_variation WHERE obo_name_variation ~ 'WBVar' OR obo_name_variation ~ 'Is' OR obo_name_variation ~ 'Si');

yields 5,350 results which could have historical names in them. For example, the allele "ok190" was deemed "unrecoverable" and so marked as "Dead" but it can still be found in Textpresso in one paper (WBPaper00004979/PMID:11559701). So I think we could probably keep these ~5,000 allele names. I'll think of the simplest way to generate the final list of alleles and send them your way.

chris-grove commented 3 years ago

@goldturtle OK here's the list of alleles to add to the "genetic perturbation" category:

https://www.dropbox.com/s/rwc5m2sk9nf6xjl/Alleles_for_Textpresso_category_Sep_15_2020.txt?dl=0

For the record, this is the Postgres query I ran to get the list:

SELECT obo_name_variation FROM obo_name_variation WHERE joinkey NOT IN (SELECT joinkey FROM obo_name_variation WHERE obo_name_variation ~ 'WBVar' OR obo_name_variation ~ 'Is' OR obo_name_variation ~ 'Si') AND joinkey NOT IN (SELECT joinkey FROM obo_data_variation WHERE obo_data_variation ~ 'Dead');

textpresso commented 3 years ago

Found this variation and it looks a bit suspicious: cewivar00323347

M.

On 9/15/20 2:16 PM, Chris Grove wrote:

@goldturtle https://github.com/goldturtle OK here's the list of alleles to add to the "genetic perturbation" category:

https://www.dropbox.com/s/rwc5m2sk9nf6xjl/Alleles_for_Textpresso_category_Sep_15_2020.txt?dl=0

For the record, this is the Postgres query I ran to get the list:

|SELECT obo_name_variation FROM obo_name_variation WHERE joinkey NOT IN (SELECT joinkey FROM obo_name_variation WHERE obo_name_variation ~ 'WBVar' OR obo_name_variation ~ 'Is' OR obo_name_variation ~ 'Si') AND joinkey NOT IN (SELECT joinkey FROM obo_data_variation WHERE obo_data_variation ~ 'Dead');|

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TextpressoDevelopers/textpressocentral/issues/3#issuecomment-692984011, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI26UEVDETB7EDEERWIXQVTSF7KTDANCNFSM4GXYYOWQ.

chris-grove commented 3 years ago

@goldturtle What's wrong? This is a valid allele:

https://wormbase.org/species/c_elegans/variation/WBVar01282747#078623b--10

goldturtle commented 3 years ago

@chris-grove Of course it's a valid allele in Wormbase, after all you got it from querying postgres on tazendra. It's just an odd name, follows a nomenclature similar to WBVar.... etc, but it's the only one in the list. So I thought it's a remnant of a previous (incomplete) database purge. The allele doesn't show up in literature.

chris-grove commented 3 years ago

@goldturtle I don't think it is problematic to leave it in. Otherwise I think we'll end up in a rabbit hole/black hole of trying to tease out all alleles that look odd.

goldturtle commented 3 years ago

@chris-grove The site is now updated (available @ http://textpressocentral.org:4040)

chris-grove commented 3 years ago

@goldturtle OK looks great, thanks! You can make it live.