Closed cha-namoth closed 9 months ago
Hi Gordon,
Apologies for the inconvenience as always. Robert and I have a PhyloFisher meeting scheduled for noon Wednesday, we will put this on the list of talking points and get back to you ASAP.
Alex
Get Outlook for iOShttps://aka.ms/o0ukef
From: Gordon Lax @.> Sent: Monday, January 15, 2024 7:06:58 PM To: TheBrownLab/PhyloFisher @.> Cc: Subscribed @.***> Subject: [TheBrownLab/PhyloFisher] Duplicate and non-trimmed genes in matrix_constructor.py output (Issue #112)
This email originated outside TTU. Please exercise cautionhttps://askit.ttu.edu/phishing!
Hi guys,
I have a weird issue with duplicated genes in the final concatenation after running matrix_constructor.py, using a small subset of genes. Looking at matrix.fas they also don't appear to be trimmed, and the duplication is also documented in indices.tsv:
Gene Start Stop BTUB 1 456 BTUB 457 912 GRC5 913 1130 GRC5 1131 1348 NSA2 1349 1610 NSA2 1611 1872 PSMD6 1873 2306 PSMD6 2307 2740 RPL12 2741 2907 RPL12 2908 3074 RPL13A 3075 3298 RPL13A 3299 3522 RPL17 3523 3712 RPL17 3713 3902 RPL19 3903 4113 RPL19 4114 4324 RPL21 4325 4498 RPL21 4499 4672 RPL30 4673 4788 RPL30 4789 4904 RPL7A 4905 5148 RPL7A 5149 5392 RPPO 5393 5715 RPPO 5716 6038 RPS12 6039 6178 RPS12 6179 6318 RPS15 6319 6470 RPS15 6471 6622 RPS17 6623 6759 RPS17 6760 6896 RPS26 6897 7008 RPS26 7009 7120
Background is: I have a dataset from which I generated a final concatenation before with matrix_constructor.py, and ran IQTree on that a while ago. No duplicated genes and everything is as it should be.
I then noticed that some of my single-gene trees used in this process had a paralogy/contamination issue with one of the taxa, so I went through the single-gene trees again with parasorter and sorted it out (haha). I then used apply_to_db.py with the updated *_parsed.tsv files using the original fisher.py output. Selecting taxa and orthologs also worked without issues, as did prepping the final dataset (although using the same completeness-cutoff with select_orthologs as before now gives me a different, bigger selection of genes). When I run matrix_constructor.py it seems to finish ok, but the output in matrix.fas is a) not trimmed and b) has each gene twice.
The logs for trimal and divvier in the matrix_constructor output are empty for each gene, the ones for prequal and mafft are there and seem fine to me.
I was using version 1.2.6 at the time and that is what I am currently sticking to as well for version consistency and tractability in the final paper. I ran my single-gene trees via IQTree, not RAxML, in case that matters.
I'm really not sure what's going on – I reran everything several times now, including with version 1.2.6 and 1.2.13, same result. Any idea of what the issue might be?
I'll happily give you any output files and folders (prefer not public if possible).
Cheers, Gordon
— Reply to this email directly, view it on GitHubhttps://github.com/TheBrownLab/PhyloFisher/issues/112, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADA4OK5LMMNAUE4QNZTXRK3YOXHDFAVCNFSM6AAAAABB4ACF22VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4DEOBWGAZTGNA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks as always Alex! And no worries – this stuff is bound to happen with something as complex as Phylofisher.
Cheers, Gordon
Hi Gordon,
We caught and cured the bug that created the duplicated genes a while back in v. 1.2.6. We have not noticed it still being a problem in v. 1.2.13 and we have made quite a few matrices lately. Will you just double check that you were actually using 1.2.13. Maybe even install it in a fresh environment and confirm the issue is still present?
Alex
Hi Alex,
Ah turns out if I use version 1.2.13 it actually works now! No duplicate genes and they were trimmed prior to concatenation. Sorry about this, apparently I did not use 1.2.13 during troubleshooting.
Thanks again for dealing with this so quickly, I appreciate it!
Cheers, Gordon
Hi guys,
I have a weird issue with duplicated genes in the final concatenation after running matrix_constructor.py, using a small subset of genes. Looking at matrix.fas they also don't appear to be trimmed, and the duplication is also documented in indices.tsv:
Background is: I have a dataset from which I generated a final concatenation before with matrix_constructor.py, and ran IQTree on that a while ago. No duplicated genes and everything is as it should be.
I then noticed that some of my single-gene trees used in this process had a paralogy/contamination issue with one of the taxa, so I went through the single-gene trees again with parasorter and sorted it out (haha). I then used apply_to_db.py with the updated *_parsed.tsv files using the original fisher.py output. Selecting taxa and orthologs also worked without issues, as did prepping the final dataset (although using the same completeness-cutoff with select_orthologs as before now gives me a different, bigger selection of genes). When I run matrix_constructor.py it seems to finish ok, but the output in matrix.fas is a) not trimmed and b) has each gene twice.
The logs for trimal and divvier in the matrix_constructor output are empty for each gene, the ones for prequal and mafft are there and seem fine to me.
I was using version 1.2.6 at the time and that is what I am currently sticking to as well for version consistency and tractability in the final paper. I ran my single-gene trees via IQTree, not RAxML, in case that matters.
I'm really not sure what's going on – I reran everything several times now, including with version 1.2.6 and 1.2.13, same result. Any idea of what the issue might be?
I'll happily give you any output files and folders (prefer not public if possible).
Cheers, Gordon