AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

duplicate entries #55

Closed Tsingsjeen closed 2 years ago

Tsingsjeen commented 2 years ago

When I am running the GToTree: GToTree -f ohr_bins/fasta_files.txt -a gtdb_reps/Bacteroidota-rep-accs.txt -H Bacteria -D -j 4 -o GToTree/

Error: gtdb_reps/Bacteroidota-rep-accs.txt has duplicate entries, check it out and provide unique accessions only. Exiting for now.

uniq -c gtdb_reps/Bacteroidota-rep-accs.txt 1 GCA_009930445.1 1 GCA_002748065.1 1 GCA_002869405.1 1 GCA_011525875.1 1 GCA_013151565.1 1 GCA_001516025.1 1 GCA_002746605.1 1 GCA_013151245.1 1 GCA_000279145.1 1 GCA_011367925.1 1 GCA_002786785.1 1 GCA_002790395.1 1 GCA_002839795.1 1 GCA_012719345.1 1 GCA_002898175.1 1 GCA_003233295.1 1 GCA_009773205.1 1 GCA_011049495.1 1 GCA_002698995.1 1 GCA_003155655.1 1 GCA_001803165.1 1 GCA_012026935.1 1 GCA_002839805.1 1 GCA_003599395.1 1 GCA_007280825.1 1 GCA_013112585.1 1 GCA_002403305.1 1 GCA_011620595.1 1 GCA_002383485.1 1 GCA_012514255.1 1 GCA_003250475.1 1 GCA_012513795.1 1 GCA_006227095.1 1 GCA_002402965.1 1 GCA_002409145.1 1 GCA_002325785.1 1 GCA_003245695.1 1 GCA_009881065.1 1 GCA_013138975.1 1 GCA_002869305.1 1 GCA_001768565.1 1 GCA_001768235.1 1 GCA_001769005.1 1 GCA_003446015.1 1 GCA_013166655.1 1 GCA_003445975.1 1 GCA_002403305.1 1 GCA_003141895.1 1 GCA_003141715.1 1 GCA_011042635.1 1 GCA_012517605.1 1 GCA_002319915.1 1 GCA_003141495.1 1 GCA_003141525.1 1 GCA_003152835.1 1 GCA_003162595.1 1 GCA_011620595.1 1 GCA_012729475.1 1 GCA_002383485.1 1 GCA_003157015.1 1 GCA_003518245.1 1 GCA_007132995.1 1 GCA_903844815.1 1 GCA_003141415.1 1 GCA_012514255.1 1 GCA_002424525.1 1 GCA_003501665.1 1 GCA_003520925.1 1 GCA_003521565.1 1 GCA_011049175.1 1 GCA_011374625.1 1 GCA_012519515.1 1 GCA_012522155.1 1 GCA_002070505.1 1 GCA_003142255.1 1 GCA_003250475.1 1 GCA_012838685.1 1 GCA_013152085.1 1 GCA_012513795.1 1 GCA_013314865.1 1 GCA_903824785.1 1 GCA_001768195.1 1 GCA_009877105.1 1 GCA_002319885.1 1 GCA_003157075.1 1 GCA_003155705.1 1 GCA_006227095.1 1 GCA_002328625.1 1 GCA_002402965.1 1 GCA_002409145.1 1 GCA_002428385.1 1 GCA_011050615.1 1 GCA_002325785.1 1 GCA_003157055.1 1 GCA_003245695.1 1 GCA_009881065.1 1 GCA_011367685.1

I am using the "uniq" to check the entries in gtdb_reps/Bacteroidota-rep-accs.txt, but there is no duplicates as above. Do you know what happended?

Thanks!

AstrobioMike commented 2 years ago

Hey there :)

The uniq command won’t spot them unless things are sorted first. Try something like this:

sort gtdb_reps/Bacteroidota-rep-accs.txt | uniq -c

or better yet

sort gtdb_reps/Bacteroidota-rep-accs.txt | uniq -c | sort -nk 1

as that should put the higher numbers clearly at the end.

To save it with no duplications into a new file, you can do this:

sort -u gtdb_reps/Bacteroidota-rep-accs.txt > gtdb_reps/Bacteroidota-rep-accs-no-dups.txt

If that doesn't solve it, please attach the Bacteroidota-rep-accs.txt file so I can look into this, thanks!

Tsingsjeen commented 2 years ago

Thanks, it worked!