EBI-Metagenomics / genomes-catalogue-pipeline

MGnify genome analysis pipeline
Other
100 stars 21 forks source link

Duplicated genomes in the UHGG v2 #32

Closed fplazaonate closed 1 year ago

fplazaonate commented 1 year ago

Hi,

I have noticed that some genomes are exactly the same in the UHGG v2.

Here is the list:

duplicated_genomes_metadata.txt

It would be great to fix this in the future versions.

Best, Florian

mberacochea commented 1 year ago

Hi @fplaza I'm not following, could you please elaborate?. Cheers

fplazaonate commented 1 year ago

Hi @mberacochea ,

The exact same genomes (i.e duplicates) are present several times (eg: MGYG000002160, MGYG00003925, MGYG000180883).

tgurbich commented 1 year ago

Hi Florian,

Thank you for pointing these out and sharing the list with us. These are in fact duplicates. We checked the source information and can see that some genomes are present twice because the original studies from which the genomes were obtained reused some of the samples. This is not unexpected in a large study like this but we will look into removing the duplicates in the future updates.

Thanks again! Tanya