Using multiple fastq files and getting info on cluster sizes

Saskia-Oosterbroek commented 2 years ago

Hi Saskia,

I end up ditching the google colab option - because each time you want to use it, you get an empty environment, it required installing conda and decona each time you want to run it, so it is not very practical. What I end up doing is setting up a virtual machine in Google Cloud and it works great- You have to pay for it, but I hope my institution can foot the bill.

I have a couple of questions / suggestions regarding the output - is this a good forum for them? Happy to repost wherever it is better.

I get a folder for each fastq file initially on my folder. I think this was designed with one fastq per barcode/sample in mind. In my case, it was only one sample but with the original output structure from Nanopore with one fastq per 4k reads. On each folder I get the initial, filtered and fasta-ed sequence files, the representatives from each cluster and the sequences that ended on each cluster. I could use the info from cluster_representatives.fast.clstr to parse how many sequences entered on each cluster - do you have a script to do that? I guess something that later returns a table on the shape of

cluster	barcode	#seqs
106-24	barcode01	345
713-2	barcode01	1650

and another fasta with

>106-24
ATGCGAGAA
>713-2
ACGTG

If not, no worries - I'll try to make one and push it your way

In each folder there is a subfolder named multi-seq with Racon Medaka and recluster - are they in order ? Is Racon the first round and medaka the second round of polishing?.

~I couldn't find a txt file with the original command the user ran - that would be useful for reproducibility and whatnot~ [Yes I found it under the report, duh! ]

Again, many thanks and congratulations for this great tool!

Originally posted by @ramongallego in https://github.com/Saskia-Oosterbroek/decona/issues/8#issuecomment-938001314

Saskia-Oosterbroek commented 2 years ago

Hi Ramon,

I am happy to hear it's working for you! It's not very convenient to have output per fastq file if those do not represent individual samples! I assume you have data from either one large run on one sample or demultiplexed your data yourself with Guppy. You should be able to fix this using the -f (folder structure) flag. It will treat fastq files in one folder as one sample.

I do have a script to give you some more info about the cluster sizes, it's the -i flag it spits out a file called "sizereport***.txt . I must admit it's output is not in a very convenient place.. Definitely something to adjust still! For now it appears per barcode or sample in their respective folders together with the files you mentioned (initial, filtered and fasta-ed sequence files). It looks like this:

Size	No. seq	No. clstr
1	11	11
2-4	16	7
5-9	34	6
10-19	22	2
20-49	107	3
50-99	0	0
100-299	206	1
300-499	821	2
500-99999	4295	1
Total	5512	33	0

But I think your question is not actually about the distribution but about the actual cluster sizes! Thank for that question, I should write that somewhere more clearly :) The cluster names actually include their numbers, which is not logical to anyone but me probably! It's build up like this:

> polished-206-7.fasta
>"it's polished with Racon" - "contains 206 reads" - "cluster number is 7".fasta

I hope that makes sense.

I have a new version of Decona on the way (I have been saying this for way too long already.. But I promise it's coming) and that will have a much more convenient fasta output of all data combined. stating something like:

>barode01_106-24
ATGCGAGAA
>barcode01_713-2
ACGTG
>barode02_175-47
ATGCTGA
>barcode02_5422-56
CGTAGA

I hope these things help, let me know if it works for you :) I'm happy with the suggestions! Saskia

ramongallego commented 2 years ago

Thanks Saskia! It does make sense - Now I can easily make the abundance table with that info. I saw the numbers in the sequence name and I didn't think what they could be. Duh!

I look forward to the new release! More on it when you have it ready, but one question that I have is that bc clustering happens within each sample, how to make sure that the same sequence in two barcodes ends up converging into the same consensus and that it has the same name (in your example, it might be that the consensus from cluster 24 from barcode01 is the same as cluster 56 from barcode02. )?

Also - one thing to add to the feature wish-list is a PCR-primer removing option. I found on my run today, not only that the primers were still there, but also that there were small chunks of the sequencing adapters still in the sequences.

Good luck and thanks for your reply and work!

Saskia-Oosterbroek / decona

Using multiple fastq files and getting info on cluster sizes #15