Add support for targeted sequencing

Mykrobe-tools / mykrobe

Antibiotic resistance prediction in minutes

MIT License

106 stars 27 forks source link

Add support for targeted sequencing #28

Open iqbal-lab opened 5 years ago

iqbal-lab commented 5 years ago

This has been sitting around as a possibility for some time; we should just do it. I propose

Chuck out species id
Update genome size to be lengths of genes in our targets (allow user specification)
Otherwise should run as normal

mbhall88 commented 2 years ago

Do we actually use the genome size? I can't seem to find any reference to it in the code.

iqbal-lab commented 2 years ago

Only to estimate depth somewhere

iqbal-lab commented 2 years ago

This issue should be closed, Mykrobe already supports amplicon sequencing . There's a force option to skip species id, and it works on amplicons afaik

iqbal-lab commented 2 years ago

I'll update properly on Tue

mbhall88 commented 2 years ago

Okay, awesome.

@martinghunt do you know where in the code genome size impacts the depth esimation? I can't seem to see it anywhere...

mbhall88 commented 1 year ago

Okay, have realised targetted/amplicon sequencing isn't really supported.

I have been running mykrobe on some amplicon data where expected median depth should be in the thousands, but am getting estimated median depth of significantly less than that (single digits or low hundreds).

I'm trying to parse the current method for estimating depth, but it is pretty convoluted...

I'll keep digging away to see if I can find the best place to handle this.

My thoughts are to have an option like --amplicon which optionally takes a fasta reference. If no fasta reference is passed, we use the size of the genes in mykrobe's panel, otherwise we use the sum of sequence lengths in the provided fasta.

Thoughts?

iqbal-lab commented 1 year ago

Sorry @mbhall88 was busy today, will think and reply tomorrow

mbhall88 commented 1 year ago

I've been playing with this locally. It's tricky. It's also extra tricky because the amplicon data I'm working with isn't amplifying entire genes. So lots of the variants have depth of like 0 or 1 and the rest have like 100,000. This ends up skewing the median depth calculation....

Any other thoughts on how to better estimate the depth? I mean, a plot of kmer_counts and cutting out the counts of the first peak and then taking the median of the rest sounds reasonable, but not sure how to do this.