faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

phyluce_align_get_only_loci_with_min_taxa gives fewer loci than expected #201

Closed claudiavaga closed 3 years ago

claudiavaga commented 3 years ago

I am trying running phyluce_align_get_only_loci_with_min_taxa with percent 0.65.

The result from phyluce_align_get_align_summary_data was this:

2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - ========= Starting phyluce_align_get_align_summary_data ========= 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Argument --cores: 20 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --input_format: nexus 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --output: None 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --show_taxon_counts: False 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --verbosity: INFO 2020-09-16 11:09:15,283 - phyluce_align_get_align_summary_data - INFO - Getting alignment files 2020-09-16 11:09:15,294 - phyluce_align_get_align_summary_data - INFO - Computing summary statistics using 20 cores 2020-09-16 11:09:36,694 - phyluce_align_get_align_summary_data - INFO - ----------------------- Alignment summary ----------------------- 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] loci: 1,660 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] length: 769,724 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] mean: 463.69 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] 95% CI: 138.78 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] min: 100 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] max: 95,816 2020-09-16 11:09:36,696 - phyluce_align_get_align_summary_data - INFO - ------------------- Informative Sites summary ------------------- 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] loci: 1,660 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] total: 125,105 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] mean: 75.36 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] 95% CI: 2.47 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] min: 0 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] max: 569 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - ------------------------- Taxon summary ------------------------- 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] mean: 39.32 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] 95% CI: 0.51 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] min: 3 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] max: 63 2020-09-16 11:09:36,701 - phyluce_align_get_align_summary_data - INFO - ----------------- Missing data from trim summary ---------------- 2020-09-16 11:09:36,701 - phyluce_align_get_align_summary_data - INFO - [Missing] mean: 9.59 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] 95% CI: 0.26 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] min: 0.00 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] max: 34.48 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - -------------------- Character count summary -------------------- 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - [All characters] 25,291,808 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - [Nucleotides] 13,731,806 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - ---------------- Data matrix completeness summary --------------- 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - [Matrix 50%] 1437 alignments 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - [Matrix 55%] 1310 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 60%] 1121 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 65%] 858 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 70%] 608 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 75%] 370 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 80%] 197 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 85%] 90 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Matrix 90%] 28 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Matrix 95%] 1 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - ------------------------ Character counts ----------------------- 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Characters] '-' is present 9,011,339 times 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Characters] '?' is present 2,548,663 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'A' is present 3,947,580 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'C' is present 2,876,041 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'G' is present 3,119,084 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'T' is present 3,789,101 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - ========= Completed phyluce_align_get_align_summary_data ========

I have than run this: phyluce_align_remove_locus_name_from_nexus_lines \

--alignments mafft-nexus-edge-trimmed \
--output mafft-nexus-edge-trimmed-gblocks-clean \
--cores 20 \
--log-path log

whit this result: 2020-09-16 11:26:55,208 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - === Starting phyluce_align_remove_locus_name_from_nexus_lines === 2020-09-16 11:26:55,208 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --cores: 20 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --input_format: nexus 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --output: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --output_format: nexus 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --taxa: None 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --verbosity: INFO 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Getting alignment files Running............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ 2020-09-16 11:27:02,724 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Taxon names in alignments: REL276,REL275,REL271,DNA08,REL162,REL163,REL160,REL167,REL165,REL378,REL202,REL201,REL206,IGS142,IGS143,IGS140,REL261,REL260,REL302,REL304,REL170,DNA32,IGS011,REL079,REL070,REL072,REL077,REL412,DNA25,REL311,REL315,REL418,REL067,REL066,IGS009,IGS008,REL063,REL141,REL069,Muss,REL400,REL243,REL242,REL087,CV10,CV13,CV15,CV16,CV17,IGS071,DNA45,Phyll,DNA41,IGS107,IGS108,IGS109,REL324,CV08,REL122,REL180,REL181,REL184,REL426,REL425,REL424,IGS138,IGS036,Madracisdecactis,IGS110,IGS113,IGS112,IGS115,IGS119,IGS118,REL134,REL130,REL197,REL194,DNA62,DNA63,IGS128,IGS125,IGS121,DNA13,DNA15,REL116,IGS031,REL298,REL112,IGS137 2020-09-16 11:27:02,724 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - === Completed phyluce_align_remove_locus_name_from_nexus_lines ==

But than when running this it only recover 14 loci: phyluce_align_get_only_loci_with_min_taxa \

--alignments mafft-nexus-edge-trimmed-gblocks-clean \
--taxa 90 \
--percent 0.65 \
--output mafft-nexus-edge-trimmed-gblocks-clean-65p \
--cores 20 \
--log-path log

2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - ======= Starting phyluce_align_get_only_loci_with_min_taxa ====== 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --cores: 20 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --input_format: nexus 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --output: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean-65p 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --percent: 0.65 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --taxa: 90 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --verbosity: INFO 2020-09-16 11:48:47,542 - phyluce_align_get_only_loci_with_min_taxa - INFO - Getting alignment files 2020-09-16 11:48:52,552 - phyluce_align_get_only_loci_with_min_taxa - INFO - Copied 14 alignments of 1660 total containing ≥ 0.65 proportion of taxa (n = 58) 2020-09-16 11:48:52,552 - phyluce_align_get_only_loci_with_min_taxa - INFO - ====== Completed phyluce_align_get_only_loci_with_min_taxa ======

Do you know what is happening?

Thank you, Claudia

brantfaircloth commented 3 years ago

hmm. that's weird. you should definitely be getting close to 858 loci. I can't tell what's going on without looking at the data, which I'm happy to do if you zip it up and send it to me (use this link):

https://www.dropbox.com/request/jrBY49nN5cmYyZKrndbO

Please send the first directory of alignments (/home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed) - i'll run the same steps you did.

If you do not want to do that, you'll need to take a look at the contents of alignments that were filtered out when you ran phyluce_align_get_only_loci_with_min_taxa - is there something about the filtered loci that explains what happened?

mateusf commented 3 years ago

Hi Claudia,

Your phyluce_align_get_align_summary_data says you have only a max of 63 individuals (taxa) in your matrices, so when you set your taxa to 90 in the phyluce_align_get_only_loci_with_min_taxa, you’ll only get the matrix where you have 65% of 90 individuals.

That’s why you’re getting less loci in the end.

All the bes,

--

Mateus Ferreira

Biólogo (CRBIO 73940/06-D)
Doutor em Genética, Conservação e Biologia Evolutiva - GCBEV/INPA

Professor Adjunto A - Centro de Estudos da Biodiversidade - CBio

Universidade Federal de Roraima – UFRR

Campus Paricana: Av. Cap. Ene Garcez, 2413, 69304-000

Boa Vista, RR, Brazil

skype: cauzuza

&

Research Associate - Dept. of Ornithology

American Museum of Natural History

Central Park West at 79th St, 10024

New York, NY, USA

From: claudiavaga notifications@github.com Sent: Wednesday, September 16, 2020 11:11 AM To: faircloth-lab/phyluce phyluce@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [faircloth-lab/phyluce] phyluce_align_get_only_loci_with_min_taxa gives fewer loci than expected (#201)

I am trying running phyluce_align_get_only_loci_with_min_taxa with percent 0.65.

The result from phyluce_align_get_align_summary_data was this:

2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - ========= Starting phyluce_align_get_align_summary_data ========= 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed 2020-09-16 11:09:15,281 - phyluce_align_get_align_summary_data - INFO - Argument --cores: 20 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --input_format: nexus 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --output: None 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --show_taxon_counts: False 2020-09-16 11:09:15,282 - phyluce_align_get_align_summary_data - INFO - Argument --verbosity: INFO 2020-09-16 11:09:15,283 - phyluce_align_get_align_summary_data - INFO - Getting alignment files 2020-09-16 11:09:15,294 - phyluce_align_get_align_summary_data - INFO - Computing summary statistics using 20 cores 2020-09-16 11:09:36,694 - phyluce_align_get_align_summary_data - INFO - ----------------------- Alignment summary ----------------------- 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] loci: 1,660 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] length: 769,724 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] mean: 463.69 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] 95% CI: 138.78 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] min: 100 2020-09-16 11:09:36,695 - phyluce_align_get_align_summary_data - INFO - [Alignments] max: 95,816 2020-09-16 11:09:36,696 - phyluce_align_get_align_summary_data - INFO - ------------------- Informative Sites summary ------------------- 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] loci: 1,660 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] total: 125,105 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] mean: 75.36 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] 95% CI: 2.47 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] min: 0 2020-09-16 11:09:36,697 - phyluce_align_get_align_summary_data - INFO - [Sites] max: 569 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - ------------------------- Taxon summary ------------------------- 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] mean: 39.32 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] 95% CI: 0.51 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] min: 3 2020-09-16 11:09:36,700 - phyluce_align_get_align_summary_data - INFO - [Taxa] max: 63 2020-09-16 11:09:36,701 - phyluce_align_get_align_summary_data - INFO - ----------------- Missing data from trim summary ---------------- 2020-09-16 11:09:36,701 - phyluce_align_get_align_summary_data - INFO - [Missing] mean: 9.59 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] 95% CI: 0.26 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] min: 0.00 2020-09-16 11:09:36,702 - phyluce_align_get_align_summary_data - INFO - [Missing] max: 34.48 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - -------------------- Character count summary -------------------- 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - [All characters] 25,291,808 2020-09-16 11:09:36,723 - phyluce_align_get_align_summary_data - INFO - [Nucleotides] 13,731,806 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - ---------------- Data matrix completeness summary --------------- 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - [Matrix 50%] 1437 alignments 2020-09-16 11:09:36,725 - phyluce_align_get_align_summary_data - INFO - [Matrix 55%] 1310 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 60%] 1121 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 65%] 858 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 70%] 608 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 75%] 370 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 80%] 197 alignments 2020-09-16 11:09:36,726 - phyluce_align_get_align_summary_data - INFO - [Matrix 85%] 90 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Matrix 90%] 28 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Matrix 95%] 1 alignments 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - ------------------------ Character counts ----------------------- 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Characters] '-' is present 9,011,339 times 2020-09-16 11:09:36,727 - phyluce_align_get_align_summary_data - INFO - [Characters] '?' is present 2,548,663 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'A' is present 3,947,580 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'C' is present 2,876,041 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'G' is present 3,119,084 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - [Characters] 'T' is present 3,789,101 times 2020-09-16 11:09:36,728 - phyluce_align_get_align_summary_data - INFO - ========= Completed phyluce_align_get_align_summary_data ========

I have than run this: phyluce_align_remove_locus_name_from_nexus_lines \

--alignments mafft-nexus-edge-trimmed \

--output mafft-nexus-edge-trimmed-gblocks-clean \

--cores 20 \

--log-path log

whit this result: 2020-09-16 11:26:55,208 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - === Starting phyluce_align_remove_locus_name_from_nexus_lines === 2020-09-16 11:26:55,208 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --cores: 20 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --input_format: nexus 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --output: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean 2020-09-16 11:26:55,209 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --output_format: nexus 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --taxa: None 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Argument --verbosity: INFO 2020-09-16 11:26:55,210 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Getting alignment files Running............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ 2020-09-16 11:27:02,724 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - Taxon names in alignments: REL276,REL275,REL271,DNA08,REL162,REL163,REL160,REL167,REL165,REL378,REL202,REL201,REL206,IGS142,IGS143,IGS140,REL261,REL260,REL302,REL304,REL170,DNA32,IGS011,REL079,REL070,REL072,REL077,REL412,DNA25,REL311,REL315,REL418,REL067,REL066,IGS009,IGS008,REL063,REL141,REL069,Muss,REL400,REL243,REL242,REL087,CV10,CV13,CV15,CV16,CV17,IGS071,DNA45,Phyll,DNA41,IGS107,IGS108,IGS109,REL324,CV08,REL122,REL180,REL181,REL184,REL426,REL425,REL424,IGS138,IGS036,Madracisdecactis,IGS110,IGS113,IGS112,IGS115,IGS119,IGS118,REL134,REL130,REL197,REL194,DNA62,DNA63,IGS128,IGS125,IGS121,DNA13,DNA15,REL116,IGS031,REL298,REL112,IGS137 2020-09-16 11:27:02,724 - phyluce_align_remove_locus_name_from_nexus_lines - INFO - === Completed phyluce_align_remove_locus_name_from_nexus_lines ==

But than when running this it only recover 14 loci: phyluce_align_get_only_loci_with_min_taxa \

--alignments mafft-nexus-edge-trimmed-gblocks-clean \

--taxa 90 \

--percent 0.65 \

--output mafft-nexus-edge-trimmed-gblocks-clean-65p \

--cores 20 \

--log-path log

2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - ======= Starting phyluce_align_get_only_loci_with_min_taxa ====== 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean 2020-09-16 11:48:47,540 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --cores: 20 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --input_format: nexus 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --output: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean-65p 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --percent: 0.65 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --taxa: 90 2020-09-16 11:48:47,541 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --verbosity: INFO 2020-09-16 11:48:47,542 - phyluce_align_get_only_loci_with_min_taxa - INFO - Getting alignment files 2020-09-16 11:48:52,552 - phyluce_align_get_only_loci_with_min_taxa - INFO - Copied 14 alignments of 1660 total containing ≥ 0.65 proportion of taxa (n = 58) 2020-09-16 11:48:52,552 - phyluce_align_get_only_loci_with_min_taxa - INFO - ====== Completed phyluce_align_get_only_loci_with_min_taxa ======

Do you know what is happening?

Thank you, Claudia

— You are receiving this because you are subscribed to this thread. Reply to this email directly, https://github.com/faircloth-lab/phyluce/issues/201 view it on GitHub, or https://github.com/notifications/unsubscribe-auth/ACJ4Q4NILICEIHJLZOQ2KS3SGDISBANCNFSM4RO7NEJA unsubscribe.

brantfaircloth commented 3 years ago

Ah, yes, Mateus is correct. You have a TOTAL of 90 taxa in your alignments:

Taxon names in alignments: REL276,REL275,REL271,DNA08,REL162,REL163,REL160,REL167,REL165,REL378,REL202,REL201,REL206,IGS142,IGS143,IGS140,REL261,REL260,REL302,REL304,REL170,DNA32,IGS011,REL079,REL070,REL072,REL077,REL412,DNA25,REL311,REL315,REL418,REL067,REL066,IGS009,IGS008,REL063,REL141,REL069,Muss,REL400,REL243,REL242,REL087,CV10,CV13,CV15,CV16,CV17,IGS071,DNA45,Phyll,DNA41,IGS107,IGS108,IGS109,REL324,CV08,REL122,REL180,REL181,REL184,REL426,REL425,REL424,IGS138,IGS036,Madracisdecactis,IGS110,IGS113,IGS112,IGS115,IGS119,IGS118,REL134,REL130,REL197,REL194,DNA62,DNA63,IGS128,IGS125,IGS121,DNA13,DNA15,REL116,IGS031,REL298,REL112,IGS137

But I am guessing you have no single alignment with all 90 taxa (i.e., according to max of 63). So, you're only getting alignments with ~75% of 90 taxa (~68 individuals), which is very few.

mateusf commented 3 years ago

You can use the flag --show_taxon_counts in the phyluce_align_get_align_summary_data to see how the individuals are distributed on your alignments.

claudiavaga commented 3 years ago

Hi Brant and Mateus,

thank you so much for your help! (and super quick answer)

I have run again the script like this: phyluce_align_get_only_loci_with_min_taxa \ --alignments mafft-nexus-edge-trimmed-gblocks-clean \ --taxa 63 \ --percent 0.65 \ --output mafft-nexus-edge-trimmed-gblocks-clean-65p \ --cores 20 \ --log-path log

And I got this result: 2020-09-16 12:31:10,098 - phyluce_align_get_only_loci_with_min_taxa - INFO - ======= Starting phyluce_align_get_only_loci_with_min_taxa ====== 2020-09-16 12:31:10,098 - phyluce_align_get_only_loci_with_min_taxa - INFO - Version: git fatal: not a git repository: '/home/geninfo/cvaga/.conda/envs/phyluce/lib/python2.7/site-packages/.git' 2020-09-16 12:31:10,098 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --alignments: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean 2020-09-16 12:31:10,098 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --cores: 20 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --input_format: nexus 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --log_path: /home/geninfo/cvaga/taxon-sets/all/log 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --output: /home/geninfo/cvaga/taxon-sets/all/mafft-nexus-edge-trimmed-gblocks-clean-65p 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --percent: 0.65 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --taxa: 63 2020-09-16 12:31:10,099 - phyluce_align_get_only_loci_with_min_taxa - INFO - Argument --verbosity: INFO 2020-09-16 12:31:10,100 - phyluce_align_get_only_loci_with_min_taxa - INFO - Getting alignment files 2020-09-16 12:31:15,015 - phyluce_align_get_only_loci_with_min_taxa - INFO - Copied 952 alignments of 1660 total containing ≥ 0.65 proportion of taxa (n = 40) 2020-09-16 12:31:15,015 - phyluce_align_get_only_loci_with_min_taxa - INFO - ====== Completed phyluce_align_get_only_loci_with_min_taxa ======

Which make way more sense

Thank you again, Claudia

AlesBucek commented 1 year ago

Hi, I came across the same thing as OP. Is calculation of completeness not from the total number of taxa but from the number of taxa in the alignment that has most taxa actually the desired functionality? Calculating completeness from the max number of taxa in any alignment seems to be not the intuitive way how to interpret "completeness" and also it differs from the description of "completeness":
[...] where “completeness” for the 75% matrix means that, in a study of 100 taxa (total), all alignments will contain at least 75 of these 100 taxa [...] (https://phyluce.readthedocs.io/en/latest/tutorials/tutorial-1.html)

brantfaircloth commented 1 year ago

The "total" taxa argument is up to the user to determine and input correctly. Let me give you an example scenario - you perform enrichments, assembly, alignment, and you think you have 100 taxa in your alignments. But, in reality, you have only a maximum of 80 taxa in your alignments. If you run the code assuming you want (and are going to get) 75% of the 100 taxa you think you have, you are going to be surprised to find that the numbers don't make sense because the maximum # of taxa you have in any alignment is 80 rather than 100. As a result, many alignments will be dropped because they have fewer than 75 taxa.

AlesBucek commented 1 year ago

Thanks for the example. I can see the benefit of defining "completeness" this way. I just did not know that this is how completeness is defined in Phyluce. When I'm extracting loci with above threshold number of taxa, I'm always using the number of all taxa in the dataset. The output of phyluce_align_get_align_summary_data was actually the first time I realized "completeness" could be defined based on the tallest alignment. It might be worth documenting what Phyluce means by "completeness" (unless I missed it).

brantfaircloth commented 1 year ago

The way that I tend to deal with this is to always run phyluce_align_get_align_summary_data before I do much of anything else to make sure that my alignments include all the taxa that I think that they do (which enables me to input the correct, total number of taxa). I'll try to think of an easy/clear way to add a note to the documentation.