ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
158 stars 49 forks source link

CAT consensus_gene_set output #256

Open francicco opened 3 years ago

francicco commented 3 years ago

Hi,

I'm looking at the and product of CAT, run only using TransMap and excluding the indel classification r.extend(find_indels(tx, psl, aln_mode)).

I'm now trying to build a table of orthologs and paralogs. The gp_info files should contain all info related to it. I didn't find a detailed description of it. It generally seems to be quite intuitive, although I'm a bit confused. The 4th field transcript_class should be the field important to what I'm doing, because it generally contains ortholog, poor_alignment or possible_paralog, but sometimes it contains gene name instead ( eg Eisa.Eisa1Z00G52.1.cg14688).

I'm confused, maybe something went wrong?

The other thing is related to the indel classification. Removing that step produces some bias I should be aware of?

Thanks a lot. F

francicco commented 3 years ago

Hi guys,

About the previous point, I realised that the problem rely on the column order. It differs in the different gp_info files

Now, I’m trying to go through the final cat output files (gp_info). What I’d like to get is the orthology inference. Most of it is contained in the transcript_class (ortholog). From what I understood when transcript_class contains “ortholog” all gene_ids are 1:1 ortholog to the source_gene id. What about possible paralogies? The paralogy field seems to be always empty, but not the unfiltered_paralogy which contains some ids from the reference annotation plus some code (eg: 1-1).

Can you please help me decipher this? Also, some of the sequences only contains short stretches of X. Maybe you want to fix this in future releases of CAT.

Thanks a lot for your support Best, F

Hmel.gp_info.gz

mhaukness-ucsc commented 3 years ago

I'll try to answer what I can here...

The ids in the unfiltered_paralogy field are possible paralogies and are all the other transcripts in the reference annotation that mapped to the target. The paralogyfield contains which of those paralogies remain after filtering, which is done based on the global-near-best parameter (by default 0.15, meaning only alignments that have a score within 15% of the highest-scoring alignment are kept). So it sounds like all of your possible paralogies got filtered out... depending on the goal of your analysis you can try adjusting global-near-best to be a higher value, or you can use all of the unfiltered alignments.

Which sequences are you referring to when you say some contain only short stretches of X?

francicco commented 3 years ago

Hi @mhaukness-ucsc,

Thanks for the answer. How can I adjust it? Can I rerun it on a finished run?

Which sequences are you referring to when you say some contain only short stretches of X?

Some of the consensus protein sequences are just "X"

Thanks F

diekhans commented 3 years ago

@francicco is is possible to make the assembly hub available through HTTP? It would be much easier if we could look at the annotations in the context of the genome to answer questions like the one about the "X"s. All that is needed is a public HTTP site for the generated assembly hub. No software needs to run.

francicco commented 3 years ago

I can't generate a http link... :/ F

francicco commented 3 years ago

Which file do you want specifically? F

mhaukness-ucsc commented 3 years ago

Unfortunately there isn't a way to adjust the parameter without rerunning the pipeline and changing it in the command line, which may or may not be worthwhile (depending on how important having only "better" paralogies included in your set is to you)

francicco commented 3 years ago

It's kind of important to have paralogy, because I'd like to identify 1:1 orthology in order to run the phylogeny and other analysis such as positive selection. F

francicco commented 3 years ago

In case, do I have to reduce it or increase it? F

diekhans commented 3 years ago

I can't generate a http link... :/

That is a bad situation for doing collaborative science as well as exploring your own data. CAT generate a UCSC browser assembly hub which you can then load remotely into the browser.

You will want to use this for analyzing your data and generating images for your paper.

I would highly suggest raise this issue and trying to get proper support. Assembly hubs are also supported in Amazon AWS S3.

diekhans commented 3 years ago

however, cactus doesn't have to be rerun..

mhaukness-ucsc @.***> writes:

Unfortunately there isn't a way to adjust the parameter without rerunning the pipeline and changing it in the command line, which may or may not be worthwhile (depending on how important having only "better" paralogies included in your set is to you)

mhaukness-ucsc commented 3 years ago

I was mostly wondering if the unfiltered_paralogies would be good enough for your purpose. You would want to increase the value of --global-near-best to leave in more paralogous alignments. The exact value is up to you to choose--how far off are you willing to have these alignments be? For example, increasing it all the way to 0.5 would leave in paralogous alignments that are within 50% of the highest scoring alignment (but this may be too much).

The specific assemblyHub file we want to link to is hub.txt in the assemblyHub output directory. For example, I have a hub.txt publicly accessible at this link: http://courtyard.gi.ucsc.edu/~jcarmstr/cat_data/primates_evan/v2/out/assemblyHub/hub.txt  that can be loaded into the browser. All of the other CAT output files are on the same server and it's all accessible by public HTTP

francicco commented 3 years ago

That is a bad situation for doing collaborative science as well as exploring your own data. CAT generate a UCSC browser assembly hub which you can then load remotely into the browser.

You will want to use this for analyzing your data and generating images for your paper.

I would highly suggest raise this issue and trying to get proper support. Assembly hubs are also supported in Amazon AWS S3.

I totally agree. I'm having my fight with the stubborn IT that doesn't undestand anything about science!!! Unfortunately I can't do much, I'm loosing my war.

Anything else I could do? F

francicco commented 3 years ago

Thanks a lot @mhaukness-ucsc I'll try 0.2, just to conferm it's working... F

mhaukness-ucsc commented 3 years ago

There might not be enough difference from 0.15 (which is the default) but it may help. Maybe 0.25?

francicco commented 3 years ago

by the way the cells are empty or with N/A which is weird to me F

diekhans commented 3 years ago

Anything else I could do?

Well, you could complain up your leadership chain on how this makes research very difficult and people should go else where ;-)

While you can run your own browser in a virtual machine internally, that doesn't help sharing the data and creates it own problems.

The browser has a guide here: https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting

The free ones all have drawbacks that probably will not work for you, so you will have to pay.

We have used AWS S3 frequently for assembly hubs. It is a bit slow, but still usable. Personally, I use The Backblaze B2 cloud for backup my home system, since I have like 3 terabytes of photos. I have no idea how well they work with serving an assembly hub, however they are significant less expensive than AWS S3.

francicco commented 3 years ago

Thanks a lot Mark, I'm trying to use MEGA and/or Backblaze to share the data with you. Meantime, I think we share previous data with you. Would you try logging in here

Another question related to this problem is the following:

I want to study ortholog group (OG) expansion/contraction. I want to make a table where: in the first column I have the id of the OG and in the following cells how many genes a species have (following columns). I can easily use the source_gene field for the OG is and as a value the sum of the gene_id + (unfiltered_paralogy or paralogy). The problem is, how do I get paralogous for the reference source_gene? Is there a way?

One way would be to exclude the reference species from the list, another would be to use an outgroup taxon and remove that one from the list...

Thanks for your help F

diekhans commented 3 years ago

yes, I can still log in. However, that doesn't give you an assembly hub. Once you have one, you will find it indispensable.

I am not sure what to suggest in terms of paralogs; Ensembl gene trees might be of help. CAT doesn't have this information.

Francesco Cicconardi @.***> writes:

Thanks a lot Mark, I'm trying to use MEGA and/or Backblaze to share the data with you. Meantime, I think we share previous data with you. Would you try logging in here

Another question related to this problem is the following:

I want to study ortholog group (OG) expansion/contraction. I want to make a table where: in the first column I have the id of the OG and in the following cells how many genes a species have (following columns). I can easily use the source_gene field for the OG is and as a value the sum of the gene_id + (unfiltered_paralogy or paralogy). The problem is, how do I get paralogous for the reference source_gene? Is there a way?

One way would be to exclude the reference species from the list, another would be to use an outgroup taxon and remove that one from the list...

Thanks for your help

francicco commented 3 years ago

I also sent you the link to MEGA where I uploaded all the files. Did you see it?

I also uploaded the data to Backblaze: https://f000.backblazeb2.com/file/CAThub/hub.txt

Cheers F

diekhans commented 3 years ago

did you try out the hub in the browser?

Francesco Cicconardi @.***> writes:

I also sent you the link to MEGA where I uploaded all the files. Did you see it?

I also uploaded the data to Backblaze: https://cathub.s3.us-west-000.backblazeb2.com/hub.txt

francicco commented 3 years ago

It works. I see the different tracks, but I don't see how that would really help me. Maybe I'm missing something F

diekhans commented 3 years ago

So send a couple of the gene ids were the proteins are Xs and we can take a look.

Francesco Cicconardi @.***> writes:

It works. I see the different tracks, but I don't see how that would really help me. Maybe I'm missing something F

francicco commented 3 years ago

Avpe_T0006658 for example:

grep -A1 Avpe_T0006658 Avpe.protein.consensus.fasta 
>Avpe_T0006658
XXXXXXXXXXXXXXXXXXXXXX
grep -A1 Avpe_T0006658 Avpe.consensus.fasta
>Avpe_T0006658
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTGCTGGTCTGCACACGCTGAGC

it's annotation for the Avpe.gff3:

Scf0000033  CAT transcript  22145   27014   8480    -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;frameshift=nan;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_G0005339;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=Avpe_T0006658;Name=Eisa.Eisa1600G444
Scf0000033  CAT exon    22145   23384   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=exon:Avpe_T0006658:0;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT intron  23385   23788   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=intron:Avpe_T0006658:0;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT exon    23789   23878   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=exon:Avpe_T0006658:1;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT CDS 23839   23878   .   -   1   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=CDS:Avpe_T0006658:0;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT intron  23879   25352   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=intron:Avpe_T0006658:1;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT exon    25353   25380   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=exon:Avpe_T0006658:2;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT CDS 25353   25380   .   -   2   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=CDS:Avpe_T0006658:1;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT intron  25381   27008   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=intron:Avpe_T0006658:2;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Scf0000033  CAT exon    27009   27014   .   -   .   source_transcript=Eisa.Eisa1600G444.1;source_transcript_name=Eisa.Eisa1600G444.1.rab2a;source_gene=Eisa.Eisa1600G444;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa1600G444.1-0;exon_annotation_support=0,0,0,0;intron_annotation_support=22,0,0;transcript_class=ortholog;valid_start=False;valid_stop=False;adj_start=23838;adj_stop=25380;proper_orf=False;paralogy=nan;unfiltered_paralogy=nan;source_gene_common_name=Eisa.Eisa1600G444;transcript_id=Avpe_T0006658;gene_id=Avpe_G0005339;Parent=Avpe_T0006658;transcript_name=Eisa.Eisa1600G444.1.rab2a;ID=exon:Avpe_T0006658:3;Name=Eisa.Eisa1600G444;rna_support=N/A;reference_support=True
Screenshot 2021-04-29 at 17 09 00
diekhans commented 3 years ago

I don't see the MEGA link in the ticket, but backblaze works great.

Oh, so cool to actually see your data. Pretty much everything you need to understand CAT annotations is here.

In this cases, the CDS mapped to small gaps (runs of Ns, so this gets translated to XXX

Francesco Cicconardi @.***> writes:

Avpe_T0006658 for example:


grep -A1 Avpe_T0006658 Avpe.protein.consensus.fasta 
>Avpe_T0006658
XXXXXXXXXXXXXXXXXXXXXX
francicco commented 3 years ago

I don't see the MEGA link in the ticket, but backblaze works great. I sent it by email

In this case, the CDS mapped to small gaps (runs of Ns, so this gets translated to XXX That was clear, I guess one wants to CAT to filter these loci out, they are not very useful

F

diekhans commented 3 years ago

NNs was going to be me guess, but it is good to confirm it.

While it would be nice to have CAT have more downstream filtering options, this is ultimately something that the user needs to decide. This might now be useful to you, but they might be useful for other users. It actually appears to be a good mapping.

Most users apply downstream filtering before they do their analysis. We are doing this for another assembly right now.

Francesco Cicconardi @.***> writes:

In this case, the CDS mapped to small gaps (runs of Ns, so this gets translated to XXX That was clear, I guess one wants to CAT to filter these loci out, they are not very useful

francicco commented 3 years ago

I'm actually fine with that, I'm more worried about the missing paralogy. F

diekhans commented 3 years ago

CAT paralog resolution needs improvement; it has more been designed around finding the correct ortholog.

Doing alignments, such as BLAT protein-translated alignments, are a way to find paralogs that CAT doesn't find.

We recently did an unannotated reference paralog analysis by using liftOff to do a self mapping to the reference.

Francesco Cicconardi @.***> writes:

I'm actually fine with that, I'm more worried about the missing paralogy.

francicco commented 3 years ago

I just would like to understand what this info means:

gene_id transcript_id   source_gene transcript_class    paralogy    unfiltered_paralogy
Hmel_G0000023   Hmel_T0000030   Eisa.Eisa3100G21    ortholog    <empty> Eisa.Eisa3100G21.1-0
Hmel_G0000024   Hmel_T0000031   Eisa.Eisa0900G280   ortholog    <empty> Eisa.Eisa0900G280.1-1,Eisa.Eisa0900G280.1-2
Hmel_G0000025   Hmel_T0000032   Eisa.Eisa2700G473   ortholog    <empty> Eisa.Eisa2700G473.1-0,Eisa.Eisa2700G473.1-1
Hmel_G0000026   Hmel_T0000033   Eisa.Eisa00006G6    ortholog    <empty> Eisa.Eisa00006G6.1-0,Eisa.Eisa00006G6.1-1

And if I can use it. Thanks F

francicco commented 3 years ago

Also I see that sometimes different gene_ids are the same on the same genomic locus.

Screenshot 2021-04-30 at 10 03 21 AM

Which is really something I don't want. So, at this point, what field in the table that define the locus, nor gene_id transcript_id do.

I'm very confused. F

diekhans commented 3 years ago

In the future, please add the coordinates you are looking at in text in the ticket. This saves typing from reading the image.

Francesco Cicconardi @.***> writes:

Also I see that sometimes different gene_ids are the same on the same genomic locus. Which is really something I don't want. So, at this point, what field in the table that define the locus, nor gene_id transcript_id do.

This maybe a case where one of the gene family members is deleted and CAT doesn't resolve the correct orthology.

CAT doesn't have a concept of locus id; you could easily pick one of the overlapping genes based on some easy criteria, like one that sorts lowest, or you can try to decide which is the more likely ortholog.

francicco commented 3 years ago

Sorry for not showing the coords.

Anyways I think this is really a problem. I don't want to sound cocky, but If CAT shows gene_id G1 and transcript_id T1 and T2 (G1.T1 and G1.T2), to me both transcripts belong to the same gene, and if the gene_ids are different, to me those transcripts belong to different genes.

Again, I can fix this somehow with a downstream analysis, but I truly believe this is a real limitation if not a mistake. Especially if someone like me is not working on very accurate species model annotation like human or drosophila. You should at least say to users that gene_id don't mean much.

I still don't fully understand the table here:

gene_id transcript_id   source_gene transcript_class    paralogy    unfiltered_paralogy
Hmel_G0000023   Hmel_T0000030   Eisa.Eisa3100G21    ortholog    <empty> Eisa.Eisa3100G21.1-0
Hmel_G0000024   Hmel_T0000031   Eisa.Eisa0900G280   ortholog    <empty> Eisa.Eisa0900G280.1-1,Eisa.Eisa0900G280.1-2
Hmel_G0000025   Hmel_T0000032   Eisa.Eisa2700G473   ortholog    <empty> Eisa.Eisa2700G473.1-0,Eisa.Eisa2700G473.1-1
Hmel_G0000026   Hmel_T0000033   Eisa.Eisa00006G6    ortholog    <empty> Eisa.Eisa00006G6.1-0,Eisa.Eisa00006G6.1-1

Would you please explain it to me. It isn't very clear to me.

Thanks a lot, and sorry for all these issues, I'm really sorry F

mhaukness-ucsc commented 3 years ago

Your understanding of the gene IDs and transcript IDs is correct. However, the problem here is likely what Mark said:

This maybe a case where one of the gene family members is deleted and CAT doesn't resolve the correct orthology.

Only one of these genes is actually there in the genome, but they weren't different enough for CAT to pick one over the other. If you don't want to do the downstream filtering to remove these sorts of situations, one thing you can do is rebuild the consensus of CAT (which doesn't require rerunning the whole pipeline! you just need to add the --rebuild-consensus flag to your command line) and add the --filter-overlapping-genes flag, which should remove one of the extra copies.

In the table, the unfiltered paralogy column basically lists all of the other alignments that also mapped to this same spot. They could potentially be paralogs of the transcript in question, but it seems as though none of them are close enough to pass through the filter to get into the paralogy column. I would be curious to see what the corresponding rows of the table look like for the genes you included in the screenshot. I'm not sure this answered any of your questions but let me know if I can elaborate further!

francicco commented 3 years ago

Hi @mhaukness-ucsc,

Thanks a lot for your reply. I'm rerunning the the finished job with those flags, lets see that I get. About the paralogy, these are the fields for that locus

Hmel_G0002292   Hmel_T0002716   Eisa.Eisa2100G629       ortholog                
Hmel_G0002293   Hmel_T0002717   Eisa.Eisa2100G633       ortholog                Eisa.Eisa2100G633.1-1,Eisa.Eisa2100G633.1-2

I don't know if it's what you were expecting, but what confuses me is the paralogy field is always empty or N/A and the unfiletered_paralogy is: empty, N/A, or when it isn't empty, the ids are always the same in source_gene field. My confusion arises because I was expecting a different gene id in the unfiletered_paralogy field, which I then can match with the source_gene field and find the related gene_id.

I need all this information to build the paralogy table. If CAT is not really designed for that, I'd just take the annotation out (gene and transcript) and run a separate orthology search, such as (Orthofinder or similar). I was just trying to find a way to do both things (annotation and paralaogy) in one go with CAT, because I think synteny is a much more reliable method.

Thanks a lot! F

francicco commented 3 years ago

Ok, it seems like --filter-overlapping-genes has an effect now.

Screenshot 2021-05-01 at 4 29 40 PM

Hmel_G0002159 Hmel_T0002552 Eisa.Eisa2100G629 ortholog

Hmel202001o CAT gene    8325863 8332135 .   +   .   source_gene_common_name=Eisa.Eisa2100G629;source_gene=Eisa.Eisa2100G629;gene_biotype=protein_coding;gene_id=Hmel_G0002159;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;gene_name=Eisa.Eisa2100G629;transcript_modes=transMap;ID=Hmel_G0002159;Name=Eisa.Eisa2100G629
Hmel202001o CAT transcript  8325863 8332135 8990    +   .   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;frameshift=nan;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_G0002159;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=Hmel_T0002552;Name=Eisa.Eisa2100G629
Hmel202001o CAT exon    8325863 8326103 .   +   .   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=exon:Hmel_T0002552:0;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True
Hmel202001o CAT CDS 8325952 8326103 .   +   0   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=CDS:Hmel_T0002552:0;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True
Hmel202001o CAT start_codon 8325952 8325954 .   +   0   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=start_codon:Hmel_T0002552;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True
Hmel202001o CAT intron  8326104 8331199 .   +   .   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=intron:Hmel_T0002552:0;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True
Hmel202001o CAT exon    8331200 8332135 .   +   .   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=exon:Hmel_T0002552:1;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True
Hmel202001o CAT CDS 8331200 8331539 .   +   1   source_transcript=Eisa.Eisa2100G629.1;source_transcript_name=Eisa.Eisa2100G629.1.cg12025;source_gene=Eisa.Eisa2100G629;transcript_modes=transMap;gene_biotype=protein_coding;transcript_biotype=protein_coding;alignment_id=Eisa.Eisa2100G629.1-0;exon_annotation_support=1,1;intron_annotation_support=49;transcript_class=ortholog;valid_start=True;valid_stop=False;adj_start=8325951;adj_stop=8331539;proper_orf=True;paralogy=nan;unfiltered_paralogy=nan;collapsed_gene_ids=Eisa.Eisa2100G633;collapsed_gene_names=Eisa.Eisa2100G633;source_gene_common_name=Eisa.Eisa2100G629;transcript_id=Hmel_T0002552;gene_id=Hmel_G0002159;Parent=Hmel_T0002552;transcript_name=Eisa.Eisa2100G629.1.cg12025;ID=CDS:Hmel_T0002552:1;Name=Eisa.Eisa2100G629;rna_support=N/A;reference_support=True

Using --global-near-best=0.50 generated to paralogy still, the field is empty.

The last thing I don't understand, but I may be very wrong here, is that I thought also other annotations should contribute to annotate all the genomes. What I mean about that is that if a gene is not present in the reference, but present in one of the other species, this would be used to "fill", propagate to the other genomes. In my dataset, it doesn't seem this is happening, because I only see transcripts in the reference to be listed in source_gene field.

Thanks really a lot!!! Best, F

mhaukness-ucsc commented 3 years ago

I'm not sure why there is still nothing being generated to the paralogy field. I actually noticed the same thing happening in one of my recent CAT annotations as well, with the field only ever being empty or "N/A". There is likely a bug somewhere, but I haven't been able to find it yet. Also, in your table, it looks like the IDs listed in as unfiltered paralogies are alignments of the same reference transcript to other places in the genome (they have the same base ID with different numbers after the hyphen) so these are not informative anyway. I would try out doing a separate orthology search as you proposed to build your table. Sorry CAT isn't working out for this purpose! Paralog resolution is an area of active development.

As far as propagating genes not in the reference, the only mode something like that could happen in is if you ran AugustusCGP, which uses the genome alignment to simultaneously predict de novo transcripts in all genomes. CAT won't go back and use the annotations in one species to fill in the others. So you should only be seeing reference transcripts as source genes.

francicco commented 3 years ago

Ok, Thank you @mhaukness-ucsc,

Now things are more clear. For some reason AugustusCGP is still not working, so what I'm gonna do is run CAT twice: one time using an outgroup as a reference and a second time using an ingroup. In this way, I think I should be able to minimise possible missing annotations in my ingroup ref species.

Thanks a lot!!! F

francicco commented 3 years ago

Hi @mhaukness-ucsc,

I promise one last question! In my dataset there are few fragmented genomes. A problem with fragmentation is having multiple fragments of the same gene mapped multiple times. I thought that using a multiple genome alignment and annotating them using CAT I'd be able to fix some of them. I'm not sure if CAT does it, but in its output there's a field possible_split_gene_locations that maybe could be helpful.

One of the record has:

gene_id transcript_id   source_gene     transcript_class        alignment_id    source_transcript       possible_split_gene_locations   alternative_source_transcripts
Eali_G0000010   Eali_T0000011   Eisa.Eisa0200G59        ortholog        Eisa.Eisa0200G59.1-4    Eisa.Eisa0200G59.1      Scf0056724:497-1209,Scf0032971:12-761   Eisa.Eisa0200G59.2

Scf0000001:176,331-179,828

Screenshot 2021-05-04 at 15 24 13

This locus seems to have 2 possible_split_gene_locations

Screenshot 2021-05-04 at 15 30 19 Screenshot 2021-05-04 at 15 15 50

How should I interpret these results? Thanks F

mhaukness-ucsc commented 3 years ago

What's going on here is that it looks like there is a gene split between different contigs in this particular genome assembly, and yes, CAT can identify those in the possible_split_gene_locations column. We would expect to see more split genes the more fragmented the input assembly is.

Here, the same original transcript from the reference (Eisa.Eisa0200G59.1) looks like it mapped to at least 4 different locations in the genome based on the numbers after the hyphen in the alignment id, (Eisa.Eisa0200G59.1-4, Eisa.Eisa0200G59.1-3, and Eisa.Eisa0200G59.1-2 can all be seen in your screenshots). CAT identifies these second two as potentially being part of a split gene with the first because the parts of the transcript projection on those other contigs have no overlaps with the one the first.

Some other things to look at: in the first screenshot, we can see that the alignment of the transcript to this locus wasn't great because of the double lines on the right half of the transMap track. This means there was unaligned sequence from the reference transcript in this region. Also, the two possible split gene locations are right at the beginnings of their contigs. So to me this looks like an issue with the fragmented input assembly, where one or both of those alternate contigs should be placed where that first screenshot is so that this gene would all be together on one contig.

CAT should resolve these gene fragment situations so that the reference gene ID only appears for one transcript in the final output files. If there are multiple copies of fragments of the gene, they might show up as "paralogy". However, I am curious what the entries for the transcripts on the other contigs (Eali_T0019442 and Eali_T0015853) look like? Do they have the same different source genes than the first, Eali_T0000011?

francicco commented 3 years ago

The source gene for Eali_T0000011 is Eisa.Eisa0200G59 while for both Eali_T0019442 and Eali_T0015853 the values are N/A and possible_paralog, respectively for source_gene and transcript_class.

Do you think it's worth trying to reconstruct the whole gene?

My idea would be to align all the fragments, in this case: Eali_T0000011, Eali_T0019442, and Eali_T0015853 to the source_gene, and generate a consensus excluding the source_gene. I'm sure it won't be a perfect reconstruction, but at least I'd reduce the fake paralogy and increase the length of the gene. I'm guessing that if the fragments overlap too much they could represent putative paralogies more than split genes.

This is how I recostructed this gene:

Screenshot 2021-05-05 at 20 48 26

Do you think it would work?

Also, to maximise the annotation of the missing genes in the reference I'm thinking to iterate CAT using multiple species from the outgroups up to the main and definitive reference species within the ingroup. Do you think is a good idea?

Thank you for replying to my continue issues with CAT.

Best, F

mhaukness-ucsc commented 3 years ago

Thanks for the info on those two transcripts. I was curious how CAT would handle the split genes when there was also possible gene family collapse. I'd guess CAT is saying they might be paralogs of the other gene in the family (which could be happening, or they might just the parts of the same split gene).

As far as reconstructing the gene, I think it depends on what sort of downstream analysis you want to do. The split parts are currently left apart because there isn't a way in these annotation files to indicate that a genomic feature is on two different contigs. To get around this, CAT tries to indicate that the feature may extend onto another contig with the possible_split_gene_locations field. But if you want to use the predicted mRNA or protein sequences for these split genes, then you could reconstruct them in a way like you describe.

I'm a bit confused by what you're describing, what do you want to change in each iteration? What species will you use as the reference, and how will you use the output from the previous runs?

francicco commented 3 years ago

I'm a bit confused by what you're describing, what do you want to change in each iteration? What species will you use as the reference, and how will you use the output from the previous runs?

I'll try to explain myself better. So at the moment, from what I understood TransMap in CAT do not cross-annotated species (meaninig that annotation present in the non-reference species can used to annotate all the other species), nor update the reference species annotation. To try to mimic a cross-annotation and improve the "real reference annotation" I was thinking to run CAT multiple times using different reference species, maybe starting from the most distant related species giving the resulted CAT annotations to the next iteration. So if I want to run 3 iterations from the species A.ite0, B.ite0, C.ite0, I'd run the iteration 1 with C.ite0 as ref, updating the annotation of A.ite0 and B.ite0, now A.ite1, B.ite1; then run the iteration 2 using B.ite1 as ref, generating A.ite2, C.ite1, and finally run iteration 3 using A.ite2 as ref generating B.ite2, C.ite2.

Make sense? Cheers F

mhaukness-ucsc commented 3 years ago

Yes, thanks for the explanation!

Do each of the species have some sort of starting annotation you can use, or is there only the reference annotation that you have currently been using?

To think out loud here, if there is only one starting reference annotation, and you're only using TransMap, I'm not sure how much (if any) information this will add. Say you project something from C.ite0 onto B.ite0 with TransMap, in the next iteration you would project it onto A.ite1, but if that works, isn't it likely that it would have already been there from projecting C.ite0 onto A.ite0 in the first iteration? I think if you were incorporating different information into the annotation pipeline (with transcriptomic data, or by running AugustusPB or AugustusCGP) then it could make a difference. But there needs to be some way of adding in de novo predictions into each iteration for those to make it onto other species or back to the reference. I think you would need to run with a mode in addition to TransMap or add in an already existing external annotation for some of your genomes to get good results. Does that make sense?

francicco commented 3 years ago

So, the reason why I'm only doing TransMap is that I couldn't make to work AugustusPB or AugustusCGP. Several bugs emerged that were never been fixed although bot Ian and Mark tried, they were quite busy. At some stage also TransMap didn't work. Time passed and I needed to finish this stage.

I have 59 (+4 published outgroup species) genomes. From these 59 I annotated (deNovo+adInitio+Augustus) 8 genomes. I then took all the transcripts from these annotations to train augustus (BRAKER) and annotate the remaining genomes, with the intent was to mimic AugustusPB. Now I'd like to iterate CAT with 6 of these annotations in order to generate a more homogeneous annotation across all species.

I'd really wish to use AugustusCGP, maybe on the last iteration if you're not too busy and willing to help me...

Thanks anyway! F

mhaukness-ucsc commented 3 years ago

Ok yes, that makes sense! With the data you have, doing the iterative process you described does sound like a good way to improve these annotations.

I agree that using AugustusCGP at the last stage would be ideal. I can try to help out with that once you get there. Let me know if you need any other help setting up these runs / debugging any errors!

francicco commented 3 years ago

Thanks a lot! I will! F

diekhans commented 3 years ago

Sorry, I don't remember where we hit a wall with AugustusPB / AugustusCGP. Can you point us at the ticket?

Francesco Cicconardi @.***> writes:

So, the reason why I'm only doing TransMap is that I couldn't make to work AugustusPB or AugustusCGP. Several bugs emerged that were never been fixed although bot Ian and Mark tried, they were quite busy. At some stage also TransMap didn't work. Time passed and I needed to finish this stage.

I have 59 (+4 published outgroup species) genomes. From these 59 I annotated (deNovo+adInitio+Augustus) 8 genomes. I then took all the transcripts from these annotations to train augustus (BRAKER) and annotate the remaining genomes, with the intent was to mimic AugustusPB. Now I'd like to iterate CAT with 6 of these annotations in order to generate a more homogeneous annotation across all species.

I'd really wish to use AugustusCGP, maybe on the last iteration if you're not too busy and willing to help me...

francicco commented 3 years ago

I got lost too... maybe here https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/244?

F

mhaukness-ucsc commented 3 years ago

and here? #218