c-zhou / oatk

A organelle de novo genome assembly toolkit
MIT License
27 stars 5 forks source link

oatk -c parameter #12

Open smoothyly opened 1 month ago

smoothyly commented 1 month ago

Hello!Chenxi,@c-zhou When I run the code syncasm -k 1001 -c 100 -t 48 -o green green.fastq

I got a error: 6b87038b96b0c9a89580983860337bf it hints my max_c is 5,but when i try to use -c 5,it can not assemble. So how to sovle it? Is there something wrong with my data? My data is raw hifi data(green.fastq )

c-zhou commented 1 month ago

Hello @smoothyly,

This max_c is different from the -c parameter. It is the maximum coverage of a kmer being identified as an error subject to error correction.

I guess when you ran it with -c 100, you did not get any sequence assembled, and with -c 5, you got too much extra staff. Is that correct? If that is the case, you could try some values in between. Do you know your data coverage? Have you tried something like GenomeScope to get a genome profile? It would be useful to see a kmer plot.

I also want to point out that setting a big -c only applies when you want to assemble organelles. If you want to build a syncmer graph for your nuclear genome, -c 5 is big enough.

Best, Chenxi

smoothyly commented 1 month ago

Actually, I just want to assemble organelles and I am not sure how big the -c should be set. I do the GenomeScope to get a genome profile as your suggestion. linear_plot

c-zhou commented 1 month ago

Your data coverage is high. You can set a very large -c, such as 200 or even 300.

Can you please show me the error you got with -c 5 and -c 100 or attach the log file from syncasm? They should both be fine for running syncasm - although with -c 5 you will get a lot of nuclear sequences assembled. The above picture you showed is not an error. It is a message from syncasm for the error correction step. But, weirdly, no error blocks were found.

If you want the organelle genomes only, you can use the oatk command instead of running the pipeline step by step. It does syncasm first, then hmm_annotation, and finally pathfinder.

Best, Chenxi

smoothyly commented 1 month ago

I try to set a large -c 200 and it is run, but it didn't produce the chloroplast file image

So Is this right? There is my log file [Uploading log.txt…]()

c-zhou commented 1 month ago

Yeah, this is a valid output. All have been finished successfully.

You have a lot of sequences assembled indicated by the size oatk_200.utg.final.gfa. There should be some nuclear sequences in there. This is OK.

It generated a mitochondria assembly of ~85Kb. It did not generate a chloroplast assembly, meaning no sequences have been annotated as chloroplast sequences. You can check the oatk_200.annot_pltd.txt to see if there are any good hits for chloroplast genes.

We can usually find chloroplast sequences with such a high depth of coverage data. Is it possible that the tissue you used for sequencing does not have chloroplast at all? Another possible reason is that the gene profile database is not very good for your species. Which database did you use? The embryophyta one? Does it match your species?

Chenxi

smoothyly commented 1 month ago

Actually,my species is an algae but I do not find the algae database. So I use the embryophyta database. If I want to assemble the chloroplast genome do you have a better suggestion?

Best wishes, Zhanwu

c-zhou commented 1 month ago

The Eembryophyta database is probably not very good for algae. We never tried.

One thing you can do is to have a look at the assembly graph oatk_200.utg.final.gfa to see if the chloroplast sequences are there. For land plants, we usually see much higher coverage of chloroplast sequences than the nuclear genome, so the subgraph for the chloroplast genome stands out clearly. I am not sure if this is the case for algae, but it is definitely worth checking.

Bandage is a nice tool to visualise the assembly graph. I am happy to help have a look if you upload a bandage screenshot of the graph.

Chenxi

smoothyly commented 4 weeks ago

Thanks for your advice. Finally I would like to ask, if I get the chloroplast sequences of algae how do I make them into fam file?

c-zhou commented 4 weeks ago

You can check this repo https://github.com/c-zhou/OatkDB

Chenxi

smoothyly commented 4 weeks ago

Actually,I try to build the file and use the code is ./oatkdb -j 4 -t 8 -c 11 -o alage_pltd_v20230911 574566 chloroplast And I got the error

cat: TEMP_574566_PLTD_v20240607/rawGBFile.gb.T*: No such file or directory
cp: cannot stat ‘TEMP_574566_PLTD_v20240607/DB.fam’: No such file or directory

Error: File existence/permissions problem in trying to open HMM file alage_pltd_v20230911.fam.
HMM file alage_pltd_v20230911.fam not found (nor an .h3m binary of it)

Any help would be appreciated! Best wishes,

Zhanwu

c-zhou commented 4 weeks ago

That means OatkDB found no green algae chloroplast genomes at the NCBI repository. Do you know if there are any?

The query performed for searching the nucleotide database at the NCBI repo was: txid574566 [Organism] AND chloroplast [Filter].

Chenxi

smoothyly commented 4 weeks ago

I've used a couple taxid's and none of them seem to work.

It got the same error.

Even if I use the taxid in the example I get the same error.

  Example: oatkdb -j 4 -t 8 -c 11 -o angiosperms_pltd_v20230911 3398 chloroplast

And I logged in to NCBI and found image I wonder if this affects?

c-zhou commented 4 weeks ago

I do not think that is the problem. I can still run it from my end.

I did a search with txid574566 [Organism] AND chloroplast [Filter] but found nothing in the nucleotide database. I found sequences with txid3398 [Organism] AND chloroplast [Filter].

c-zhou commented 4 weeks ago

I found two algae datasets here. The embryophyta database seems fine for algae.

The chloroplast coverage varies. One sample has about 60-fold higher coverage than the nuclear genome and the other one has only about 4-fold higher coverage. We should probably use a low coverage threshold (-c) to avoid missing the chloroplast sequences. For your dataset, -c 100 might be OK.

I also changed the code to increase the upper bound of the empirical chloroplast size. It was 200Kb and now changed to 250Kb. I got 240Kb chloroplast assembly for my algae sample. I also assembled some angiosperm chloroplasts larger than 200Kb. So it might be good to change it.

You can download the source code, compile it, and rerun oatk with -c 100 -p embryophyta_pltd.fam -m embryophyta_mito.fam.

Chenxi