Open adriaaula opened 1 year ago
Regarding completeness, the analysis seems to favour working with the transcriptomes.
The info about the genomes is the following:
`Genome_Id final names` total_length Estimated_length Nombre_de_genes ANVIO_completion ANVIO_redund…¹ BUSCO…² BUSCO…³
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TARA_ARC_108_MAG_00262 26410259 134746219. 11086 13.2 9.64 19.6 1.2
2 TARA_MED_95_MAG_00475 14445854 167975047. 6149 3.61 0 8.6 0.4
3 TARA_SOC_28_MAG_00069 21522871 124409659. 10292 13.2 2.41 17.3 0
Both completeness and the number of genes is quite low. If we compare it to the transcriptomes:
file format type num_seqs sum_len min_len avg_len max_len
EP00618_Florenciella_parvula.fasta FASTA Protein 40,380 10,010,661 30 247.9 11,348
EP00619_Florenciella_sp_RCC1007.fasta FASTA Protein 22,072 3,185,622 30 144.3 2,751
EP00620_Florenciella_sp_RCC1587.fasta FASTA Protein 24,752 6,095,897 30 246.3 7,121
EP00621_Florenciella_sp_RCC1693.fasta FASTA Protein 16,681 3,537,849 30 212.1 3,445
These numbers show that the transcriptomes are more complete than the MAGs, and therefore it makes more sense to work with them. The genomes present between 14 and 26 M of nucleotides, whereas the transcriptomes present between 10.5M and 30M of the coding information (hypotetically with an overall completeness).
We could abandon them and if they are somehow useful in posterior analysis, we could return at them
I think the MED MAG would be great to include as an outgroup.
For ecotype analyses, I would try to use the transcriptomes for the Florenciella species, since there will be higher resolution to distinguish among ecotypes. And they represent a group in the tree, which you would lose if you exclude them.
I did some comparisons between the TOPAZ, SMAGs and eukprot genetic material for Florenciella.
To do it I did the following:
sourmash
. The signatures are kmer hashes of 31 nucleotides and 10 aa, and they easen comparisons.I obtained the following distribution for the genomes:
Given that these genomes come from the same data, we would expect that the MAGs coming from Delmont would cluster together with the TOPAZ MAGs, but this doesn't seem to be the case in our analysis. Possible explanations are that 1) each method recovered different regions of the genome, or 2) one of the approaches is way more contaminated and this breaks the similarities.
The minimum ANI similarity is 0.558.
When we compare the transcriptomes the values change slightly:
This result, with the known true positive (the transcriptomes from Eukprot) shows us that 1) the TOPAZ MAGs (green and red clusters) are either contaminated or not Florenciella and that 2) the MED MAG for Florenciella (the number 5 in the plot) is the most different one to the transcriptomes obtained.
We can therefore work the Delmont sMAGs, given that they are representative of the known diversity for this group. Questions:
Thats it