Selection of MAGs and transcriptomes to work with

adriaaula commented 1 year ago

I did some comparisons between the TOPAZ, SMAGs and eukprot genetic material for Florenciella.

To do it I did the following:

Parse taxonomy files of sMAGs and TOPAZ and extract any MAG that presents taxonomy to the genus level, and assigned to Florenciella.
Obtain directly from eukprot the present transcriptomes, which in this case are only the predicted proteins.
Calculate signatures for each of the FASTAs using sourmash. The signatures are kmer hashes of 31 nucleotides and 10 aa, and they easen comparisons.
Compare these sketches/signatures using ANI measurements.

I obtained the following distribution for the genomes:

0   data/genomic_data/genomes/TARA_ARC_108_MAG_00262.fa
1   data/genomic_data/genomes/TARA_MED_95_MAG_00475.fa
2   data/genomic_data/genomes/TARA_SOC_28_MAG_00069.fa
3   data/genomic_data/genomes/TOPAZ_MSS1_E028/TOPAZ_MSS1_E028.fna.gz
4   data/genomic_data/genomes/TOPAZ_MSS1_E030/TOPAZ_MSS1_E030.fna.gz
5   data/genomic_data/genomes/TOPAZ_SAS1_E003/TOPAZ_SAS1_E003.fna.gz
6   data/genomic_data/genomes/TOPAZ_SAS1_E006/TOPAZ_SAS1_E006.fna.gz
7   data/genomic_data/genomes/TOPAZ_SPS1_E066/TOPAZ_SPS1_E066.fna.gz

compar_genomes dist matrix

Given that these genomes come from the same data, we would expect that the MAGs coming from Delmont would cluster together with the TOPAZ MAGs, but this doesn't seem to be the case in our analysis. Possible explanations are that 1) each method recovered different regions of the genome, or 2) one of the approaches is way more contaminated and this breaks the similarities.

The minimum ANI similarity is 0.558.

When we compare the transcriptomes the values change slightly:

0   data/genomic_data/transcriptomes/EP00618_Florenciella_parvula.fasta
1   data/genomic_data/transcriptomes/EP00619_Florenciella_sp_RCC1007.fasta
2   data/genomic_data/transcriptomes/EP00620_Florenciella_sp_RCC1587.fasta
3   data/genomic_data/transcriptomes/EP00621_Florenciella_sp_RCC1693.fasta
4   data/genomic_data/genomes/TARA_ARC_108_MAG_00262.gmove.pep.faa
5   data/genomic_data/genomes/TARA_MED_95_MAG_00475.gmove.pep.faa
6   data/genomic_data/genomes/TARA_SOC_28_MAG_00069.gmove.pep.faa
7   data/genomic_data/genomes/TOPAZ_MSS1_E028/TOPAZ_MSS1_E028.faa.gz
8   data/genomic_data/genomes/TOPAZ_MSS1_E030/TOPAZ_MSS1_E030.faa.gz
9   data/genomic_data/genomes/TOPAZ_SAS1_E003/TOPAZ_SAS1_E003.faa.gz
10  data/genomic_data/genomes/TOPAZ_SAS1_E006/TOPAZ_SAS1_E006.faa.gz
11  data/genomic_data/genomes/TOPAZ_SPS1_E066/TOPAZ_SPS1_E066.faa.gz

compar_trans dist matrix

This result, with the known true positive (the transcriptomes from Eukprot) shows us that 1) the TOPAZ MAGs (green and red clusters) are either contaminated or not Florenciella and that 2) the MED MAG for Florenciella (the number 5 in the plot) is the most different one to the transcriptomes obtained.

We can therefore work the Delmont sMAGs, given that they are representative of the known diversity for this group. Questions:

Should I include the MED MAG?
Should I work with the transcriptomes too? Map the reads to them and so on? Given that we want to check ecotypes and similar approaches, my go-to decision would be to focus specifically to the MAGs.

Thats it

adriaaula commented 1 year ago

Regarding completeness, the analysis seems to favour working with the transcriptomes.

The info about the genomes is the following:

  `Genome_Id final names` total_length Estimated_length Nombre_de_genes ANVIO_completion ANVIO_redund…¹ BUSCO…² BUSCO…³
  <chr>                          <dbl>            <dbl>           <dbl>            <dbl>          <dbl>   <dbl>   <dbl>
1 TARA_ARC_108_MAG_00262      26410259       134746219.           11086            13.2            9.64    19.6     1.2
2 TARA_MED_95_MAG_00475       14445854       167975047.            6149             3.61           0        8.6     0.4
3 TARA_SOC_28_MAG_00069       21522871       124409659.           10292            13.2            2.41    17.3     0

Both completeness and the number of genes is quite low. If we compare it to the transcriptomes:

file                                   format  type     num_seqs     sum_len  min_len  avg_len  max_len
EP00618_Florenciella_parvula.fasta     FASTA   Protein    40,380  10,010,661       30    247.9   11,348
EP00619_Florenciella_sp_RCC1007.fasta  FASTA   Protein    22,072   3,185,622       30    144.3    2,751
EP00620_Florenciella_sp_RCC1587.fasta  FASTA   Protein    24,752   6,095,897       30    246.3    7,121
EP00621_Florenciella_sp_RCC1693.fasta  FASTA   Protein    16,681   3,537,849       30    212.1    3,445

These numbers show that the transcriptomes are more complete than the MAGs, and therefore it makes more sense to work with them. The genomes present between 14 and 26 M of nucleotides, whereas the transcriptomes present between 10.5M and 30M of the coding information (hypotetically with an overall completeness).

We could abandon them and if they are somehow useful in posterior analysis, we could return at them

djrichter commented 1 year ago

I think the MED MAG would be great to include as an outgroup.

For ecotype analyses, I would try to use the transcriptomes for the Florenciella species, since there will be higher resolution to distinguish among ecotypes. And they represent a group in the tree, which you would lose if you exclude them.

beaplab / transcriptome_metaT_quantification

Selection of MAGs and transcriptomes to work with #1