Open iqbal-lab opened 5 years ago
Do we actually use the genome size? I can't seem to find any reference to it in the code.
Only to estimate depth somewhere
This issue should be closed, Mykrobe already supports amplicon sequencing . There's a force option to skip species id, and it works on amplicons afaik
I'll update properly on Tue
Okay, awesome.
@martinghunt do you know where in the code genome size impacts the depth esimation? I can't seem to see it anywhere...
Okay, have realised targetted/amplicon sequencing isn't really supported.
I have been running mykrobe on some amplicon data where expected median depth should be in the thousands, but am getting estimated median depth of significantly less than that (single digits or low hundreds).
I'm trying to parse the current method for estimating depth, but it is pretty convoluted...
I'll keep digging away to see if I can find the best place to handle this.
My thoughts are to have an option like --amplicon
which optionally takes a fasta reference. If no fasta reference is passed, we use the size of the genes in mykrobe's panel, otherwise we use the sum of sequence lengths in the provided fasta.
Thoughts?
Sorry @mbhall88 was busy today, will think and reply tomorrow
I've been playing with this locally. It's tricky. It's also extra tricky because the amplicon data I'm working with isn't amplifying entire genes. So lots of the variants have depth of like 0 or 1 and the rest have like 100,000. This ends up skewing the median depth calculation....
Any other thoughts on how to better estimate the depth? I mean, a plot of kmer_counts and cutting out the counts of the first peak and then taking the median of the rest sounds reasonable, but not sure how to do this.
This has been sitting around as a possibility for some time; we should just do it. I propose