PATRIC3 / patric3_website

Legacy PATRIC Website (JBoss Portal Version)
MIT License
5 stars 2 forks source link

If a pseudogene overlaps a CDS, can we delete the CDS? #1822

Open ARWattam opened 6 years ago

ARWattam commented 6 years ago

https://alpha.patricbrc.org/view/Feature/PATRIC.139.22.AYCB01000014.CDS.1316.1486.rev#view_tab=overview

I have run into this a lot recently. A short CDS, but I have to go to the feature page and then I see that there is something that overlaps it completely, and that gene is a pseudogene. It would probably have saved me a couple of hours if those CDSs were removed once they were called.

screen shot 2017-12-12 at 3 57 51 pm
JoshuaVSherman commented 6 years ago

@ARWattam i'm assuming this is a backend data related request, not something client-side for the user to click delete on?

ARWattam commented 6 years ago

I would assign this to Maulik. He'll know who to send it to that handles RAST

On Fri, Jan 26, 2018 at 12:27 PM, Joshua V Sherman <notifications@github.com

wrote:

@ARWattam https://github.com/arwattam i'm assuming this is a backend data related request, not something client-side for the user to click delete on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-360849646, or mute the thread https://github.com/notifications/unsubscribe-auth/AK8ZdNWEv5Y_xvj4-jqmiUK_Z3M8pV90ks5tOgsSgaJpZM4Q_n7- .[image: Web Bug from https://github.com/notifications/beacon/AK8ZdHh7jCORtkSev8MMPs8vvF9-8v5Kks5tOgsSgaJpZM4Q_n7-.gif]

mshukla1 commented 5 years ago

All pseudogenes in PATRIC annotations come from the Classic RAST days, i.e. before 2014. Only first 20k genomes were annotated using classic RAST.

RAST never annotated pseudogenes. Because in PATRIC1 we cared about pseudogenes, we asked RAST team to specifically call them for PATRIC in classic RAST as an additional step. The pseudogenes were added on top of the other predicted genes, resulting in annotation of overlapping genes and pseudogenes.

Since we switched to RASTtk in PATRIC3, we no longer annotate pseudogenes.

Given pseudogenes are no longer annotated and that they exist only in a very small subset of genomes, we have two options.

  1. Leave them as it is and not worry about them.

  2. To make it consistent a cross all genomes, delete all PATRIC-annotated pseudogenes from old genomes.

@ARWattam Let me know your preference.

-Maulik

ARWattam commented 5 years ago

How will the discussion we had in the last Science meeting, led by @olsonanl affect this? Just wondering. And would leaving them affect the PFams in any way?

olsonanl commented 5 years ago

How should it affect them? We can and should do what we think is the most correct and useful to the biological analyses.

We do get pseudogene calls on (some of the) genbank imports; if we wish we can delete or mark as pseudo (as opposed to pseudogene - cf the document Maulik found at NCBI regarding the distinction they make). I think we need to tread carefully.

ARWattam commented 5 years ago

My original opinion was to get rid of them all, but I also was sensitive to the other opinion of letting sleeping dogs lie, and if it is just some of the older genomes and not broadly shared, it might take a lot of effort and not be worth directing resources toward it. But then I started thinking about the Pattyfams, and wondering if these tiny CDSs are creating problems there by generating pseudo protein families. We would certainly see this across Mycobacterium leprae, and maybe the other pathogens undergoing genomic degradation as well.

From: olsonanl notifications@github.com Reply-To: PATRIC3/patric3_website reply@reply.github.com Date: Saturday, April 13, 2019 at 7:23 PM To: PATRIC3/patric3_website patric3_website@noreply.github.com Cc: "Wattam, Rebecca (arw3s)" wattam@virginia.edu, Mention mention@noreply.github.com Subject: Re: [PATRIC3/patric3_website] If a pseudogene overlaps a CDS, can we delete the CDS? (#1822)

How should it affect them? We can and should do what we think is the most correct and useful to the biological analyses.

We do get pseudogene calls on (some of the) genbank imports; if we wish we can delete or mark as pseudo (as opposed to pseudogene - cf the document Maulik found at NCBI regarding the distinction they make). I think we need to tread carefully.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482897854, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK8ZdLmCj64vtucvuBvj_ESfrQU_Jonkks5vgmbbgaJpZM4Q_n7-.

mshukla1 commented 5 years ago

I am thinking about adding pseudo or partial tag to CDS features, which can be used to flag all problematic features.

That way, we can still annotate them with function and protein families, etc.

Then, we can provide filters to how or hide them in protein family sorter, compare region, genome browser etc, or color them in a different way.

Using the tag, they can be excluded from the construction of signature kmers and de novo protein families to maintain their quality and false propagation.

In Addition, we can also use partial to good CDS ratio as a measure for genome quality.

-M

Sent from my iPhone

On Apr 14, 2019, at 7:31 AM, ARWattam notifications@github.com<mailto:notifications@github.com> wrote:

My original opinion was to get rid of them all, but I also was sensitive to the other opinion of letting sleeping dogs lie, and if it is just some of the older genomes and not broadly shared, it might take a lot of effort and not be worth directing resources toward it. But then I started thinking about the Pattyfams, and wondering if these tiny CDSs are creating problems there by generating pseudo protein families. We would certainly see this across Mycobacterium leprae, and maybe the other pathogens undergoing genomic degradation as well.

From: olsonanl notifications@github.com<mailto:notifications@github.com> Reply-To: PATRIC3/patric3_website reply@reply.github.com<mailto:reply@reply.github.com> Date: Saturday, April 13, 2019 at 7:23 PM To: PATRIC3/patric3_website patric3_website@noreply.github.com<mailto:patric3_website@noreply.github.com> Cc: "Wattam, Rebecca (arw3s)" wattam@virginia.edu<mailto:wattam@virginia.edu>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [PATRIC3/patric3_website] If a pseudogene overlaps a CDS, can we delete the CDS? (#1822)

How should it affect them? We can and should do what we think is the most correct and useful to the biological analyses.

We do get pseudogene calls on (some of the) genbank imports; if we wish we can delete or mark as pseudo (as opposed to pseudogene - cf the document Maulik found at NCBI regarding the distinction they make). I think we need to tread carefully.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482897854, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK8ZdLmCj64vtucvuBvj_ESfrQU_Jonkks5vgmbbgaJpZM4Q_n7-.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482964616, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAJN337SI6PSKWZVJW5XHBTPQMPZDANCNFSM4EH6P37A.

ARWattam commented 5 years ago

And I have been thinking about it while cleaning bathrooms…enlightening. It could be that when these genes are breaking in a way that gives them two CDSs with good start and stop codons, those might indeed have function. We really don’t know, and perhaps won’t know until proteomics becomes as cheap as genomics. So I like Maulik’s proposal…leaving the door open, somewhat, to the possibility of importance.

From: Maulik Shukla notifications@github.com Reply-To: PATRIC3/patric3_website reply@reply.github.com Date: Sunday, April 14, 2019 at 9:22 AM To: PATRIC3/patric3_website patric3_website@noreply.github.com Cc: "Wattam, Rebecca (arw3s)" wattam@virginia.edu, Mention mention@noreply.github.com Subject: Re: [PATRIC3/patric3_website] If a pseudogene overlaps a CDS, can we delete the CDS? (#1822)

I am thinking about adding pseudo or partial tag to CDS features, which can be used to flag all problematic features.

That way, we can still annotate them with function and protein families, etc.

Then, we can provide filters to how or hide them in protein family sorter, compare region, genome browser etc, or color them in a different way.

Using the tag, they can be excluded from the construction of signature kmers and de novo protein families to maintain their quality and false propagation.

In Addition, we can also use partial to good CDS ratio as a measure for genome quality.

-M

Sent from my iPhone

On Apr 14, 2019, at 7:31 AM, ARWattam notifications@github.com<mailto:notifications@github.com> wrote:

My original opinion was to get rid of them all, but I also was sensitive to the other opinion of letting sleeping dogs lie, and if it is just some of the older genomes and not broadly shared, it might take a lot of effort and not be worth directing resources toward it. But then I started thinking about the Pattyfams, and wondering if these tiny CDSs are creating problems there by generating pseudo protein families. We would certainly see this across Mycobacterium leprae, and maybe the other pathogens undergoing genomic degradation as well.

From: olsonanl notifications@github.com<mailto:notifications@github.com> Reply-To: PATRIC3/patric3_website reply@reply.github.com<mailto:reply@reply.github.com> Date: Saturday, April 13, 2019 at 7:23 PM To: PATRIC3/patric3_website patric3_website@noreply.github.com<mailto:patric3_website@noreply.github.com> Cc: "Wattam, Rebecca (arw3s)" wattam@virginia.edu<mailto:wattam@virginia.edu>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [PATRIC3/patric3_website] If a pseudogene overlaps a CDS, can we delete the CDS? (#1822)

How should it affect them? We can and should do what we think is the most correct and useful to the biological analyses.

We do get pseudogene calls on (some of the) genbank imports; if we wish we can delete or mark as pseudo (as opposed to pseudogene - cf the document Maulik found at NCBI regarding the distinction they make). I think we need to tread carefully.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482897854, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK8ZdLmCj64vtucvuBvj_ESfrQU_Jonkks5vgmbbgaJpZM4Q_n7-.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482964616, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAJN337SI6PSKWZVJW5XHBTPQMPZDANCNFSM4EH6P37A.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/PATRIC3/patric3_website/issues/1822#issuecomment-482975227, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK8ZdMvPpbnNbUlLlVMpDojMN6BHP77Oks5vgyuqgaJpZM4Q_n7-.