Identity on very large data

BioinformaticsToolsmith / Identity

Other

32 stars 3 forks source link

Identity on very large data #7

Open davidmaimoun opened 2 years ago

davidmaimoun commented 2 years ago

Hi,

Thanks for the tool!

Can I use it on a very large data, like 300K-400K genomes (reads & assemblies)

Best,

hani-girgis commented 2 years ago

Hello.

How long are these genomes? And how many reads are there?

Best,

Hani Z. Girgis, PhD

davidmaimoun commented 2 years ago

Approximately 5Mb, and I think to focus on assemblies first, i.e, if I have ~400k fastas, it would be possible to use Identity on it?

Thank yoy

hani-girgis commented 2 years ago

I believe so. How much RAM do you have available? The -b and -v parameters will help you with controlling the memory consumption. I'd start with -b of 1000 and -v of 1000. Please keep me posted.

Best regards.

davidmaimoun commented 2 years ago

Thank you very much for your help,

I'll let you know when I get the results

Kind Regards

LauraVP1994 commented 1 year ago

I also have a quite large dataset and it has already been running a week. I was wondering if there are possibilities to speed this up (I have tried already to reduce the number of sequences as much as possible)? Also, being able to see how much it has done and still to do, would be great to know...

hani-girgis commented 1 year ago

Hi.

Would you please provide some information about your dataset? How many sequences? What is the average length? What parameters are you using to run Identity?

Best regards.

LauraVP1994 commented 1 year ago

I have multiple datasets on which I would like to use it as I'm using this tool to select sequences to shrink down my dataset. I have for example a dataset of 553 123 Campylobacter coli sequences for which the length ranges from 20340 to 1822675 bp.

I'm currently using this command: srun --cpus-per-task 40 --mem=100G identity -d -Campylobacter.coli_concatenated_filter.fasta -o Campylobacter.coli_identity.txt -t 0.9

hani-girgis commented 1 year ago

Hello there.

I would divide the sequences into groups based on length. Plotting the length distribution will help with finding the boundaries. Then I'd cluster each group separately. If you would like me to a take a look at the plot, feel free to email it to me at hzgirgis at buffalo dot edu. Please keep me posted.

Best regards.

LauraVP1994 commented 1 year ago

And how big should these groups be then ideally (in numbers and difference in lengths)?

hani-girgis commented 1 year ago

You should use a large number of short sequences and a small number of long sequences. A plot of the length distribution would help with finding the cut-offs.