Open smoothyly opened 1 month ago
Hello @smoothyly,
This max_c is different from the -c
parameter. It is the maximum coverage of a kmer being identified as an error subject to error correction.
I guess when you ran it with -c 100
, you did not get any sequence assembled, and with -c 5
, you got too much extra staff. Is that correct? If that is the case, you could try some values in between. Do you know your data coverage? Have you tried something like GenomeScope to get a genome profile? It would be useful to see a kmer plot.
I also want to point out that setting a big -c
only applies when you want to assemble organelles. If you want to build a syncmer graph for your nuclear genome, -c 5
is big enough.
Best, Chenxi
Actually, I just want to assemble organelles and I am not sure how big the -c should be set.
I do the GenomeScope to get a genome profile as your suggestion.
Your data coverage is high. You can set a very large -c
, such as 200 or even 300.
Can you please show me the error you got with -c 5
and -c 100
or attach the log file from syncasm? They should both be fine for running syncasm
- although with -c 5
you will get a lot of nuclear sequences assembled. The above picture you showed is not an error. It is a message from syncasm
for the error correction step. But, weirdly, no error blocks were found.
If you want the organelle genomes only, you can use the oatk
command instead of running the pipeline step by step. It does syncasm
first, then hmm_annotation
, and finally pathfinder
.
Best, Chenxi
I try to set a large -c 200 and it is run, but it didn't produce the chloroplast file
So Is this right? There is my log file [Uploading log.txt…]()
Yeah, this is a valid output. All have been finished successfully.
You have a lot of sequences assembled indicated by the size oatk_200.utg.final.gfa
. There should be some nuclear sequences in there. This is OK.
It generated a mitochondria assembly of ~85Kb. It did not generate a chloroplast assembly, meaning no sequences have been annotated as chloroplast sequences. You can check the oatk_200.annot_pltd.txt
to see if there are any good hits for chloroplast genes.
We can usually find chloroplast sequences with such a high depth of coverage data. Is it possible that the tissue you used for sequencing does not have chloroplast at all? Another possible reason is that the gene profile database is not very good for your species. Which database did you use? The embryophyta one? Does it match your species?
Chenxi
Actually,my species is an algae but I do not find the algae database. So I use the embryophyta database. If I want to assemble the chloroplast genome do you have a better suggestion?
Best wishes, Zhanwu
The Eembryophyta database is probably not very good for algae. We never tried.
One thing you can do is to have a look at the assembly graph oatk_200.utg.final.gfa
to see if the chloroplast sequences are there. For land plants, we usually see much higher coverage of chloroplast sequences than the nuclear genome, so the subgraph for the chloroplast genome stands out clearly. I am not sure if this is the case for algae, but it is definitely worth checking.
Bandage is a nice tool to visualise the assembly graph. I am happy to help have a look if you upload a bandage screenshot of the graph.
Chenxi
Thanks for your advice. Finally I would like to ask, if I get the chloroplast sequences of algae how do I make them into fam file?
You can check this repo https://github.com/c-zhou/OatkDB
Chenxi
Actually,I try to build the file and use the code is
./oatkdb -j 4 -t 8 -c 11 -o alage_pltd_v20230911 574566 chloroplast
And I got the error
cat: TEMP_574566_PLTD_v20240607/rawGBFile.gb.T*: No such file or directory
cp: cannot stat ‘TEMP_574566_PLTD_v20240607/DB.fam’: No such file or directory
Error: File existence/permissions problem in trying to open HMM file alage_pltd_v20230911.fam.
HMM file alage_pltd_v20230911.fam not found (nor an .h3m binary of it)
Any help would be appreciated! Best wishes,
Zhanwu
That means OatkDB found no green algae chloroplast genomes at the NCBI repository. Do you know if there are any?
The query performed for searching the nucleotide database at the NCBI repo was: txid574566 [Organism] AND chloroplast [Filter]
.
Chenxi
I've used a couple taxid's and none of them seem to work.
It got the same error.
Even if I use the taxid in the example I get the same error.
Example: oatkdb -j 4 -t 8 -c 11 -o angiosperms_pltd_v20230911 3398 chloroplast
And I logged in to NCBI and found
I wonder if this affects?
I do not think that is the problem. I can still run it from my end.
I did a search with txid574566 [Organism] AND chloroplast [Filter]
but found nothing in the nucleotide database. I found sequences with txid3398 [Organism] AND chloroplast [Filter]
.
I found two algae datasets here. The embryophyta database seems fine for algae.
The chloroplast coverage varies. One sample has about 60-fold higher coverage than the nuclear genome and the other one has only about 4-fold higher coverage. We should probably use a low coverage threshold (-c
) to avoid missing the chloroplast sequences. For your dataset, -c 100
might be OK.
I also changed the code to increase the upper bound of the empirical chloroplast size. It was 200Kb and now changed to 250Kb. I got 240Kb chloroplast assembly for my algae sample. I also assembled some angiosperm chloroplasts larger than 200Kb. So it might be good to change it.
You can download the source code, compile it, and rerun oatk with -c 100 -p embryophyta_pltd.fam -m embryophyta_mito.fam
.
Chenxi
Hello!Chenxi,@c-zhou When I run the code
syncasm -k 1001 -c 100 -t 48 -o green green.fastq
I got a error:
it hints my max_c is 5,but when i try to use -c 5,it can not assemble. So how to sovle it?
Is there something wrong with my data? My data is raw hifi data(green.fastq )