anuradhawick / MetaBCC-LR

Reference-free Binning of Metagenomics Long Reads using Coverage and Composition
https://doi.org/10.1093/bioinformatics/btaa441
MIT License
19 stars 0 forks source link

Ground truth format #6

Closed prashantpandey closed 3 years ago

prashantpandey commented 3 years ago

What should be the format for the ground truth file?

Currently, I am using

 readname  <tab> bin_id
 readname  <tab> bin_id
 readname  <tab> bin_id
....

But I am getting an error TypeError: unhashable type: 'numpy.ndarray' at line 308 in mbcclr_utils/binner_core.py.

prashantpandey commented 3 years ago

Bump!! I am currently running MetaBCC-LR on long reads simulated using CAMISIM. However, I am seeing precision/recall <10%. I want to make sure if I am missing something. Therefore, I want to input the ground-truth to your system and see if your system also computes the same precision/recall.

anuradhawick commented 3 years ago

Hi apologies for the delayed reply. The input for ground truth is just the bin Id on each line. Please note that each read must have a ground truth bin because of this. However, if you have reads without a ground truth, feel free to put the as "unclassified" or an arbitrary name. Let me know if this helps.

Also try to change the sensitivity and see if it improves.

Best regards Anuradha

prashantpandey commented 3 years ago

Hi Anuradha,

I ran MetaBCC-LR on a sample dataset with 3 species and also specified the ground truth. This is what I got.

2020-12-09 14:50:42,417 - INFO - Total reads              692
2020-12-09 14:50:42,417 - INFO - Precision              0.301
2020-12-09 14:50:42,417 - INFO - Recall                 0.987
2020-12-09 14:50:42,418 - INFO - ARI                    0.029

There are a couple of issues here:

  1. The number of reads is about two orders of magnitude off. Is there an error in reporting or did MetaBCC-LR didn't process the rest of the reads?
  2. Why is the Recall higher than Precision?

Thanks, Prashant

anuradhawick commented 3 years ago

Hi,

The binning tool performs a sampling step to handle a large number of reads in long reads samples. In all our tests, we use the number of reads greater than 10000 - 100000 reads. The evaluation only happened for the sampled set of reads. In this case, it seems the tool sample ~700 reads which is too low for our algorithm.

You can manually adjust the number of sampled reads to say ~10000 or ~15000 (for this you have to simulate more than 10000 or 15000 reads). As we did not expect the ground truth to be submitted for our tool we did not implement that to be performed on the final output. This can only evaluate the performance of the sampled set of reads, note that the performance metrics of sampled reads and final output are quite similar. We had a separate simple python script to evaluate the complete set of reads. I have attached that function here in this comment. You have to use this with final bins file and along with the ground truths (the required files should be there in the output directory, which results after filtering out reads lower than 1000bp in length).

def evaluate_clusters(clusters, species):
    all_clusters = list(clusters.keys())
    all_species = species

    s_map = {}
    c_map = {}

    labels = []
    truth = []

    for s in all_species:
        s_map[s] = len(s_map)

    for c in all_clusters:
        c_map[c] = len(c_map)

    mat = [[0 for c in range(len(c_map))] for s in range(len(s_map))]

    for c in all_clusters:
        cc = clusters[c]
        for s in cc.getLabels():
            mat[s_map[s]][c_map[c]] += 1
            labels.append(c_map[c])
            truth.append(s_map[s])

    rMax = 0
    cMax = 0
    tot = 0

    for x in mat:
        tot += sum(x)
        rMax += max(x)

    matT = [[mat[j][i] for j in range(len(mat))] for i in range(len(mat[0]))]

    for x in matT:
        cMax += max(x)

    for n, s in enumerate(all_species):
        mat[n] = [s] + mat[n]

    logger.info("\n" + tabulate(mat, headers=[""] + [c for c in all_clusters], tablefmt="fancy_grid"))

    logger.info(f"Total reads  {tot:15}")
    logger.info(f"Precision    {rMax / tot:15.3f}")
    logger.info(f"Recall       {cMax / tot:15.3f}")
    logger.info(f"ARI          {adjusted_rand_score(truth, labels):15.3f}")

As per my observations in many experiments that I did, recall gets very high when the number of bins is underestimated. If you're happy to share your simulation parameters and species I am happy to run them. However, that is likely to happen after the long Christmas holidays as I will be having limited resource access during holidays.

Let me know if this helps, Best regards Anuradha