Fasta files for training/testing the data?

Mass23 / NOMIS_ENSEMBLE

3 stars 1 forks source link

Fasta files for training/testing the data? #1

Open jolespin opened 4 years ago

jolespin commented 4 years ago

Looking forward to collaborating on this! I just forked over the repo. I'm having trouble finding the fasta files used for training/testing the models.

Is the main goal to classify Eukaryotic contigs vs. everything else or Euk vs. Bac vs. Arc vs. Virus?

susheelbhanu commented 4 years ago

Hey Josh, The main goal is to go after the Eukaryotic contigs, but if in the process we get the others, i.e. bacteria and viruses, then there's no harm in it. So, we're working towards a generic approach.

The fasta files were a little too large to upload (approx.600G), so we're trying an alternate approach with first counting k-mers and taking it from there. I will upload the files as soon as I can get out of the lab, and keep you posted when we do. Sorry to keep you waiting, and thanks. In the meantime, have you tried: [https://github.com/soedinglab/metaeuk]

-Susheel

jolespin commented 4 years ago

Yes, that makes sense the files would be huge. I've been making good use of clumpify to optimize the compression of large sequence files but it would still be quite large. I have some ideas I want to try out regarding the preprocessing into k-mers that could help out the classification. Do you maybe have an accession list? I could download the assemblies separately and try out a few preprocessing methods.

It would be helpful if we could use a benchmark dataset to compare against other published tools. I've tried obtaining the accessions from EukRep supplementary but there are only species names. I will try to download the sequences given the fuzzy name identifiers.

I haven't tried MetaEuk but i'll read over it for the holidays. Looks like it is very new. From a skim through, it appears to be heavily influenced by annotations?

jolespin commented 4 years ago

Hope all is well. Just wanted to follow up on this. I will be fairly busy for the next week or so but should have some experimental time after the new year.

What should be the best way to move forward with this?

I have some ideas on how we could preprocess the data and organize the models in a particular way. Maybe we can discuss this over a conference call in the next few weeks? Also, if the plan is to develop a method for, at the very least, classifying Euks from Proks (hopefully Euks vs. Bac vs. Arch vs. Virus? vs. ?) it would be helpful to have the same benchmark dataset as EukRep. What are your thoughts on this? I just got some coral samples with quite a few eukaryotic contigs in it that I could test.