fwhelan / coinfinder

A tool for the identification of coincident (associating and dissociating) genes in pangenomes.
GNU General Public License v3.0
92 stars 9 forks source link

will manually modify zero-length branches affect the output? #24

Closed limin321 closed 3 years ago

limin321 commented 4 years ago

Hi Thank you for developing this tool. It is really helpful to understand my data more. I ran coinfinder using 311 bacteria strains. My phylogenetic tree has some zero-length branches, so what I did is I manually adding 1 to all branch length to avoid this issue. It ran successfully. I want to know if this will affect the results.

By the way, I don't see any details helping me understand the output files. could you please explain more about output files. There are 2 pdf, 4 tsv, 2 csv and 1 .gexf files. Is that right? Could you please add more information about how to understand each output files in README, like what kind of information each file has, and for gexf file which tool to open it (this one of course I can google it)?

One output file called heamap0.pdf, it should look like this, however,

Screen Shot 2020-08-09 at 11 33 26 AM

when I open in pdf, I zoom in to 400% in pdf, and label names are overlapping, is there a way for me make the label look better? part of the pic looks like this: Screen Shot 2020-08-09 at 11 38 42 AM

In the above pic, you can see, there are two strains(17_2069_2c, 2b) which are totally blank? no information at all. How to interpret this missing info in the heatmap0?

And on the bottom, all info overlap together, I couldn't read them at all. Any suggestions? Here is how it looks" Screen Shot 2020-08-09 at 11 44 38 AM

Really appreciate your help and your time, Best,

Thanks a lot.

fwhelan commented 4 years ago

Hello, thanks for your interest in coinfinder and for your questions. I'll address each one by one below:

  1. Adjusting zero-length branches in your phylogeny won't affect your results. They might affect the D value (the value of lineage independence, which is based on the phylogeny) slightly, but it will not affect which genes are labelled as associating or dissociating.

  2. Details on what each of the output files are is provided in Table 1 of the coinfinder publication https://doi.org/10.1099/mgen.0.000338. GEXF files can be opened with the open source tool Gephi (among other options).

  3. Unfortunately some overlap in the labels might happen depending on the size of your phylogeny and output clusters. Here it looks like the cluster of co-occurring genes labelled in purple is quite large; coinfinder by default tries to display all the genes in a cluster as part of the same heatmap which makes the labels hard to read in this case. You could try altering the network.R code to suit your needs, or to plot the data in _pairs.tsv for gene pairs of interest with your own method of choice.

  4. The strains with missing data appear not to have these particular genes within their genomes. You could double check this by looking at the raw data in your gene_presence_absence.csv (or equilvalent); if you think this is a bug in coinfinder, please provide me with a reproducible minimal example and I'd be happy to take a look.

I hope that's helpful! Let me know if you have any further questions. --Fiona

fwhelan commented 4 years ago

Closing due to inactivity.

limin321 commented 4 years ago

Hello, thanks for your interest in coinfinder and for your questions. I'll address each one by one below:

  1. Adjusting zero-length branches in your phylogeny won't affect your results. They might affect the D value (the value of lineage independence, which is based on the phylogeny) slightly, but it will not affect which genes are labelled as associating or dissociating.
  2. Details on what each of the output files are is provided in Table 1 of the coinfinder publication https://doi.org/10.1099/mgen.0.000338. GEXF files can be opened with the open source tool Gephi (among other options).
  3. Unfortunately some overlap in the labels might happen depending on the size of your phylogeny and output clusters. Here it looks like the cluster of co-occurring genes labelled in purple is quite large; coinfinder by default tries to display all the genes in a cluster as part of the same heatmap which makes the labels hard to read in this case. You could try altering the network.R code to suit your needs, or to plot the data in _pairs.tsv for gene pairs of interest with your own method of choice.
  4. The strains with missing data appear not to have these particular genes within their genomes. You could double check this by looking at the raw data in your gene_presence_absence.csv (or equilvalent); if you think this is a bug in coinfinder, please provide me with a reproducible minimal example and I'd be happy to take a look.

I hope that's helpful! Let me know if you have any further questions. --Fiona

Hi Fiona

I am so sorry for the late reply. I noticed my gene_presence_absence.csv run into some issue, and I have been troubleshooting all these time. Now I used the correct gene_presence_absence.csv, and got the same confusing heatmap1 as I mentioned long time ago. I appreciate that you would be willing to test my data. May I have your email so I could send you my subset of data. Thanks.

By the way, I have another question, I used PIRATE for runing pangenome analysis, and one of the PIRATE output file called: pangenome.gfa - GFA network file representing all unique connections between gene families (extracted from the GFF files). Can be loaded and visualised in Bandage. coinfinder also generate a lot of association between genes, do you know what is the difference between the unique connections between gene families from PIRATE and the associated gene groups from coinfinder?

Thank you so much for all your help. Really appreciate that. Best, and stay safe. Limin

fwhelan commented 4 years ago

Can you please explain what you mean when you say you get a confusing heatmap?

I did a quick search of the PIRATE manuscript and github and can't find a more detailed explanation of the pangenome.gfa file. You might ask the developers of PIRATE for more information; right now, I can't say much of how coinfinder compares.

limin321 commented 4 years ago

Can you please explain what you mean when you say you get a confusing heatmap?

I did a quick search of the PIRATE manuscript and github and can't find a more detailed explanation of the pangenome.gfa file. You might ask the developers of PIRATE for more information; right now, I can't say much of how coinfinder compares.

Hi, Thank you for your quick response.

Screen Shot 2020-09-16 at 10 35 37 PM

As showed above, there are some strains that cannot be show in the heatmap. The whole line just blank. The same issue as I posted in the first question long time ago. And you said you can help me test. I just don't understand in both heatmap.pdf, why some strains do not show any colorful bars since they do have genes presence in all strains.

Thanks. Best LImin

limin321 commented 4 years ago

Can you please explain what you mean when you say you get a confusing heatmap?

I did a quick search of the PIRATE manuscript and github and can't find a more detailed explanation of the pangenome.gfa file. You might ask the developers of PIRATE for more information; right now, I can't say much of how coinfinder compares.

Hi, Dr. Fiona

I found something interesting. In my coinfinder output A_components.tsv, there is one pair of association: g005352_000002__g005352,g004800_000002__g004800;

And then I searched in pirate output "pangenome.gfa", the same two gene family are also linked with each other.
L g005352 + g004800 + 0M though g005352 also links with many other gene family at the same time.

The above is just one example, I am sure if I compare more, there may be more similar results. Do you have any thoughts about this ?

I also post similar question here https://github.com/GFA-spec/GFA-spec/issues/101 Hopefully, I will get answer on how to understand *.gfa files.

Thanks, Best, Limin

fwhelan commented 4 years ago

Ah okay. If these strains do have these genes in the gene_presence_absence.csv file then it might be an issue with special characters in the genome names. Could you please provide an example here of the name of a strain in the phylogeny (exactly as it appears in the heatmap output) and the beginning of the appropriate line corresponding to the same strain in the gene_presence_absence.csv file? I'm wondering if a special character is causing these to not match and thus no genes to be displayed in the final output. You can paste these here into a comment.

In terms of the .gfa, I think it would be more appropriate to post your question to the PIRATE team then to GFA-spec.

limin321 commented 4 years ago

Ah okay. If these strains do have these genes in the gene_presence_absence.csv file then it might be an issue with special characters in the genome names. Could you please provide an example here of the name of a strain in the phylogeny (exactly as it appears in the heatmap output) and the beginning of the appropriate line corresponding to the same strain in the gene_presence_absence.csv file? I'm wondering if a special character is causing these to not match and thus no genes to be displayed in the final output. You can paste these here into a comment.

In terms of the .gfa, I think it would be more appropriate to post your question to the PIRATE team then to GFA-spec.

Hi Thank you so much. Since my data is going to publish one day. May I have your email, so I could send a subset of gene_presence_absence.csv including the exact strain name to you. The table is too big to copy things here.

Thanks,

fwhelan commented 4 years ago

I really just need the first 5 or so columns of the appropriate line, but my email is on my github profile if you won't want to paste the data here.

limin321 commented 4 years ago

I really just need the first 5 or so columns of the appropriate line, but my email is on my github profile if you won't want to paste the data here.

Hi, I sent test data to this email: fiona.whelan@nottingham.ac.uk Please let me know if you don't get it. I attached strains that are not able to show in the heatmap.

Thank you so much.

Best, limin

fwhelan commented 4 years ago

It looks like in your gene_p_a file you have strains such as X17_2069_2b which appear in the heatmap output as 17_2069_2b (the leading X missing). Is the leading X present in the phylogeny input file?

(I can't actually run the file you provided me through coinfinder as not all fields are in the format of a quoted csv.

limin321 commented 4 years ago

It looks like in your gene_p_a file you have strains such as X17_2069_2b which appear in the heatmap output as 17_2069_2b (the leading X missing). Is the leading X present in the phylogeny input file?

(I can't actually run the file you provided me through coinfinder as not all fields are in the format of a quoted csv.

Hi Fiona

Sorry for including the messing infor in the test file. Actually the one I used for funning coinfinder doesn't have 'X' in front of those names. The reason the test one had 'X' becasue I opened in R and forgot that I modified.

Here is screenshort of part the version of gene presence_absence table I used for running coinfinder.

Screen Shot 2020-09-28 at 4 28 32 PM

For my phylogeny, they have the same label as gene_p_a. Because I created my phylogeny from PIRATE core gene alignment. Since both phylogeny and gene_p_a are from PIRAT, the labels of both files match with one another.

Please let me if you need any more info.

Best, Limin

fwhelan commented 4 years ago

Hmm ok, thanks Limin. Would you be able to provide me with a gene_p_a and a phylogeny that I could use to run coinfinder and reproduce the issue? Thanks.

limin321 commented 4 years ago

Hmm ok, thanks Limin. Would you be able to provide me with a gene_p_a and a phylogeny that I could use to run coinfinder and reproduce the issue? Thanks.

Hi, Thank you so much.

Please give me one day to subset my tree or if I could get my PI permission to send you my whole data. I will email you once I talked to my PI. Thanks a lot for trying to help.

Best, Limin

fwhelan commented 4 years ago

No rush at all. The other option would be to provide me with a subset of the files, as you did before, but ensuring that they (a) run through coinfinder and (b) reproduce the bug in the heatmaps that you're seeing above. Thanks.

limin321 commented 4 years ago

No rush at all. The other option would be to provide me with a subset of the files, as you did before, but ensuring that they (a) run through coinfinder and (b) reproduce the bug in the heatmaps that you're seeing above. Thanks.

Hi, I just emailed you my data, including one tree and gene presence absence data. Please let me know if you don't get it.

Best, Limin

fwhelan commented 3 years ago

Hi Limin, thanks so much for providing your data in order to help me find and fix this bug. The bug was related to data not being displayed properly in the heatmap output if the genome name began with a numerical character. It didn't effect the raw output files but did affect the heatmap visuals.

Thanks to your help, I've fixed this issue with commit fa8e340. I will release a new version of coinfinder now to bioconda which will be v.1.0.7 which should be live within 24hrs (depending on how busy the bioconda site is).

Thanks for your patience and your help with this, Limin!

fwhelan commented 3 years ago

v1.0.7 is available on bioconda now. Please let me know how you get on with it and if you still have any issues.

limin321 commented 3 years ago

v1.0.7 is available on bioconda now. Please let me know how you get on with it and if you still have any issues.

Hi Fiona,

Thank you so much for helping do the troubleshooting. I will try again when then new version is released. My server is down these day, it may take a while to update the version and run it again.

Best, Limin

fwhelan commented 3 years ago

I'm going to close this issue but please feel free to reopen if you're still having troubles.