TheBrownLab / PhyloFisher

PhyloFisher is a software package written in Python3 that can be used for the creation, analysis, and visualization of phylogenomic datasets that consist of eukaryotic protein sequences.
MIT License
31 stars 15 forks source link

Duplicate and non-trimmed genes in matrix_constructor.py output #112

Closed cha-namoth closed 9 months ago

cha-namoth commented 9 months ago

Hi guys,

I have a weird issue with duplicated genes in the final concatenation after running matrix_constructor.py, using a small subset of genes. Looking at matrix.fas they also don't appear to be trimmed, and the duplication is also documented in indices.tsv:

Gene    Start   Stop
BTUB    1   456
BTUB    457 912
GRC5    913 1130
GRC5    1131    1348
NSA2    1349    1610
NSA2    1611    1872
PSMD6   1873    2306
PSMD6   2307    2740
RPL12   2741    2907
RPL12   2908    3074
RPL13A  3075    3298
RPL13A  3299    3522
RPL17   3523    3712
RPL17   3713    3902
RPL19   3903    4113
RPL19   4114    4324
RPL21   4325    4498
RPL21   4499    4672
RPL30   4673    4788
RPL30   4789    4904
RPL7A   4905    5148
RPL7A   5149    5392
RPPO    5393    5715
RPPO    5716    6038
RPS12   6039    6178
RPS12   6179    6318
RPS15   6319    6470
RPS15   6471    6622
RPS17   6623    6759
RPS17   6760    6896
RPS26   6897    7008
RPS26   7009    7120

Background is: I have a dataset from which I generated a final concatenation before with matrix_constructor.py, and ran IQTree on that a while ago. No duplicated genes and everything is as it should be.

I then noticed that some of my single-gene trees used in this process had a paralogy/contamination issue with one of the taxa, so I went through the single-gene trees again with parasorter and sorted it out (haha). I then used apply_to_db.py with the updated *_parsed.tsv files using the original fisher.py output. Selecting taxa and orthologs also worked without issues, as did prepping the final dataset (although using the same completeness-cutoff with select_orthologs as before now gives me a different, bigger selection of genes). When I run matrix_constructor.py it seems to finish ok, but the output in matrix.fas is a) not trimmed and b) has each gene twice.

The logs for trimal and divvier in the matrix_constructor output are empty for each gene, the ones for prequal and mafft are there and seem fine to me.

I was using version 1.2.6 at the time and that is what I am currently sticking to as well for version consistency and tractability in the final paper. I ran my single-gene trees via IQTree, not RAxML, in case that matters.

I'm really not sure what's going on – I reran everything several times now, including with version 1.2.6 and 1.2.13, same result. Any idea of what the issue might be?

I'll happily give you any output files and folders (prefer not public if possible).

Cheers, Gordon

atice commented 9 months ago

Hi Gordon,

Apologies for the inconvenience as always. Robert and I have a PhyloFisher meeting scheduled for noon Wednesday, we will put this on the list of talking points and get back to you ASAP.

Alex

Get Outlook for iOShttps://aka.ms/o0ukef


From: Gordon Lax @.> Sent: Monday, January 15, 2024 7:06:58 PM To: TheBrownLab/PhyloFisher @.> Cc: Subscribed @.***> Subject: [TheBrownLab/PhyloFisher] Duplicate and non-trimmed genes in matrix_constructor.py output (Issue #112)

This email originated outside TTU. Please exercise cautionhttps://askit.ttu.edu/phishing!

Hi guys,

I have a weird issue with duplicated genes in the final concatenation after running matrix_constructor.py, using a small subset of genes. Looking at matrix.fas they also don't appear to be trimmed, and the duplication is also documented in indices.tsv:

Gene Start Stop BTUB 1 456 BTUB 457 912 GRC5 913 1130 GRC5 1131 1348 NSA2 1349 1610 NSA2 1611 1872 PSMD6 1873 2306 PSMD6 2307 2740 RPL12 2741 2907 RPL12 2908 3074 RPL13A 3075 3298 RPL13A 3299 3522 RPL17 3523 3712 RPL17 3713 3902 RPL19 3903 4113 RPL19 4114 4324 RPL21 4325 4498 RPL21 4499 4672 RPL30 4673 4788 RPL30 4789 4904 RPL7A 4905 5148 RPL7A 5149 5392 RPPO 5393 5715 RPPO 5716 6038 RPS12 6039 6178 RPS12 6179 6318 RPS15 6319 6470 RPS15 6471 6622 RPS17 6623 6759 RPS17 6760 6896 RPS26 6897 7008 RPS26 7009 7120

Background is: I have a dataset from which I generated a final concatenation before with matrix_constructor.py, and ran IQTree on that a while ago. No duplicated genes and everything is as it should be.

I then noticed that some of my single-gene trees used in this process had a paralogy/contamination issue with one of the taxa, so I went through the single-gene trees again with parasorter and sorted it out (haha). I then used apply_to_db.py with the updated *_parsed.tsv files using the original fisher.py output. Selecting taxa and orthologs also worked without issues, as did prepping the final dataset (although using the same completeness-cutoff with select_orthologs as before now gives me a different, bigger selection of genes). When I run matrix_constructor.py it seems to finish ok, but the output in matrix.fas is a) not trimmed and b) has each gene twice.

The logs for trimal and divvier in the matrix_constructor output are empty for each gene, the ones for prequal and mafft are there and seem fine to me.

I was using version 1.2.6 at the time and that is what I am currently sticking to as well for version consistency and tractability in the final paper. I ran my single-gene trees via IQTree, not RAxML, in case that matters.

I'm really not sure what's going on – I reran everything several times now, including with version 1.2.6 and 1.2.13, same result. Any idea of what the issue might be?

I'll happily give you any output files and folders (prefer not public if possible).

Cheers, Gordon

— Reply to this email directly, view it on GitHubhttps://github.com/TheBrownLab/PhyloFisher/issues/112, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADA4OK5LMMNAUE4QNZTXRK3YOXHDFAVCNFSM6AAAAABB4ACF22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4DEOBWGAZTGNA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

cha-namoth commented 9 months ago

Thanks as always Alex! And no worries – this stuff is bound to happen with something as complex as Phylofisher.

Cheers, Gordon

atice commented 9 months ago

Hi Gordon,

We caught and cured the bug that created the duplicated genes a while back in v. 1.2.6. We have not noticed it still being a problem in v. 1.2.13 and we have made quite a few matrices lately. Will you just double check that you were actually using 1.2.13. Maybe even install it in a fresh environment and confirm the issue is still present?

Alex

cha-namoth commented 9 months ago

Hi Alex,

Ah turns out if I use version 1.2.13 it actually works now! No duplicate genes and they were trimmed prior to concatenation. Sorry about this, apparently I did not use 1.2.13 during troubleshooting.

Thanks again for dealing with this so quickly, I appreciate it!

Cheers, Gordon