SionBayliss / PIRATE

A toolbox for pangenome analysis and threshold evaluation.
GNU General Public License v3.0
91 stars 29 forks source link

Missing genome in output #70

Closed iferres closed 3 years ago

iferres commented 3 years ago

Hi, I have been trying PIRATE on simulated gffs and there's a genome which is missing in the final .tsvs files whereas they get into the initial steps of the workflow.

I attach here the PIRATE.gene_families.ordered.tsv output file (I change the extension to "csv" to comply with github formats, but it's the tsv). See the empty column for genome1. PIRATE.gene_families.ordered.csv I can provide you the simulated gff if you want to test it, or any log file.

Also, I would like to know what does the "(1)" suffix mean attached to some gene names.

SionBayliss commented 3 years ago

Hi Ignacio,

The _1 (or _integer) after a gene name means that the gene family is multicopy and has split into multiple gene families by PIRATE. Sometimes this is unsuitable for your analysis and can be switched off.

When PIRATE runs it initially tells you how many gffs have passed QC. Did your genome get to this stage?

All the best, Sion


From: Ignacio Ferrés @.> Sent: 23 June 2021 18:45 To: SionBayliss/PIRATE @.> Cc: Subscribed @.***> Subject: [SionBayliss/PIRATE] Missing genome in output (#70)

Hi, I have been trying PIRATE on simulated gffs and there's a genome which is missing in the final .tsvs files whereas they get into the initial steps of the workflow.

I attach here the PIRATE.gene_families.ordered.tsv output file (I change the extension to "csv" to comply with github formats, but it's the tsv). See the empty column for genome1. PIRATE.gene_families.ordered.csvhttps://github.com/SionBayliss/PIRATE/files/6704171/PIRATE.gene_families.ordered.csv I can provide you the simulated gff if you want to test it, or any log file.

Also, I would like to know what does the "(1)" suffix mean attached to some gene names.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/SionBayliss/PIRATE/issues/70, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACXFVLGBTRFVNIMRYL4SV7TTUITSZANCNFSM47GLY33A.

iferres commented 3 years ago

Thanks Sion,

For what I see, genome1.gff appears to make it into the modified_gffs directory, and its structure looks fine, same as the rest of the genomes.

SionBayliss commented 3 years ago

Hi Ignacio,

Do you want to send my ~3 genomes for me to do a test run on?

All the best, Sion


From: Ignacio Ferrés @.> Sent: 24 June 2021 11:18 To: SionBayliss/PIRATE @.> Cc: Sion Bayliss @.>; Comment @.> Subject: Re: [SionBayliss/PIRATE] Missing genome in output (#70)

Thanks Sion,

For what I see, genome1.gff appears to make it into the modified_gffs directory, and its structure looks fine, same as the rest of the genomes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/SionBayliss/PIRATE/issues/70#issuecomment-867554331, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACXFVLHAHJ7YUWD6AHURU5LTUMIAVANCNFSM47GLY33A.

iferres commented 3 years ago

Here is the dataset where it fails, there're 10 simulated genomes. The gffs looks little weird because I needed to simulate one gene per contig, but PIRATE seems to cope with them except for a few.

simulated_gffs.tar.gz

SionBayliss commented 3 years ago

Hi Ignacio,

It looks like PIRATE is filtering out all of the genes in genome1.

It is doing this during the feature extraction step. The filtering criteria are that a CDS must have a nucleotide length divisible by 3, <5% NS, a consensus start/stop codon and be >120 bp long. It might be worth checking to see if this is the case for your simulated genes in genome 1?

All the best, Sion


From: Ignacio Ferrés @.> Sent: 24 June 2021 12:35 To: SionBayliss/PIRATE @.> Cc: Sion Bayliss @.>; Comment @.> Subject: Re: [SionBayliss/PIRATE] Missing genome in output (#70)

Here is the dataset where it fails, there're 10 simulated genomes. The gffs looks little weird because I needed to simulate one gene per contig, but PIRATE seems to cope with them except for a few.

simulated_gffs.tar.gzhttps://github.com/SionBayliss/PIRATE/files/6709168/simulated_gffs.tar.gz

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/SionBayliss/PIRATE/issues/70#issuecomment-867600996, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACXFVLHTEVYZNIKN2ZLGKPTTUMRAJANCNFSM47GLY33A.

iferres commented 3 years ago

Ok, yes. Apparently genome1 doesn't pass the filter. I checked and it's the only genome where any of its genes pass all of them. Is there any way to relax these filters? Many genes from the other genomes may be being filtered out as well. Thanks for your time. Closing this.

SionBayliss commented 3 years ago

Hi Ignacio,

You can uncomment/edit out the relevant lines in extract_feature_sequences.pl:

Gene length:

my $length_threshold = 120;

Triplicate reading frame (this won't make much sense if you are treating it like a CDS and will likely throw errors later in the pipeline):

# must be divisible by 3
if( ($l % 3) != 0 ){
$include = 0; <- this line
}

Consensus stop and start codons. You can also change the default expected codons earlier in the script:

# have consensus stop codon.
if ( ! $stop_codons{substr($seq, -3)} ){
$include = 0; <- this line
}

# have consensus start codon.
if ( ! $start_codons{substr($seq, 0, 3)} ){
$include = 0; <- this line
}

5% NS:

# have <5% Ns
if( ($ns/$l) > "0.05" ){ <- this value if you want to change the default proportions of N sites.
$include = 0; <- this line
}

Hope that helps, Sion


From: Ignacio Ferrés @.> Sent: 24 June 2021 20:18 To: SionBayliss/PIRATE @.> Cc: Sion Bayliss @.>; Comment @.> Subject: Re: [SionBayliss/PIRATE] Missing genome in output (#70)

Closed #70https://github.com/SionBayliss/PIRATE/issues/70.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/SionBayliss/PIRATE/issues/70#event-4936665011, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACXFVLEC255HZJJGJYDGSYLTUOHJ7ANCNFSM47GLY33A.

iferres commented 3 years ago

Thanks a lot Sion. Bests, Ignacio

iferres commented 3 years ago

Hi Sion, hope you are doing well.

I removed the filters as you suggested but genome1.gff is still being filtered out. Here's the extract_feature_sequences.pl script on my PIRATE's fork: https://github.com/iferres/PIRATE/blob/master/scripts/extract_feature_sequences.pl

I tested the build with a Singularity container, I could provide the definition file if you are familiar with Singularity.

I digged into the code but couldn't find anything, I'm not a perl expert tho. I'd appreciate any other suggestion. Bests!

SionBayliss commented 3 years ago

HI Ignacio,

Sorry for not getting back to you sooner, this has been an insane few months for me. This method worked for me on the files you supplied. Are you sure that PIRATE is calling the appropriate file that you modified? Perhaps clone PIRATE, comment out the appropriate sections and then call that version of PIRATE using the full path to see if that works.

Again, sorry for the long time to reply.

All the best, Sion

iferres commented 3 years ago

Sion, just to let you know that I read your last message, and sorry for not answering before. It's okay to me to close this issue since I don't know when am I going to address it. I will come back if I address it in the following months. Bests!

SionBayliss commented 3 years ago

Hi Ignacio,

No problem. If you get back to it please let me know and I will try and help however I can.

All the best, Sion


From: Ignacio Ferrés @.> Sent: 15 November 2021 12:46 To: SionBayliss/PIRATE @.> Cc: Sion Bayliss @.>; State change @.> Subject: Re: [SionBayliss/PIRATE] Missing genome in output (#70)

Sion, just to let you know that I read your last message, and sorry for not answering before. It's okay to me to close this issue since I don't know when am I going to address it. I will come back if I address it in the following months. Bests!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/SionBayliss/PIRATE/issues/70#issuecomment-968876249, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACXFVLEON3UMYZILADU742LUMD6LBANCNFSM47GLY33A. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.