CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
480 stars 189 forks source link

Abstract core algorithm in to C library #96

Open IanSudbery opened 7 years ago

IanSudbery commented 7 years ago

In order to allow wider adoption of the UMI-tools deduplication methods beyond simple position + UMI based deduping we would like to create a C library that implemented the core algos.

As much as possible we would like to the abstracted C calls to take a list of UMIs and their counts, and return either a list of deduplicated UMIs or grouped UMIs according to the specified method choice.

IanSudbery commented 7 years ago

The branch code_algo_in_c contains the beginnings of this. Commit 32c1ed6 contains first attempt at the abstraction problem.

Currently I've tried to remove any refference to BAM objects from the class ReadClusterer, which this commit renames UMIClusterer. Read specifics are now delt with by the new ReadClusterer, which basically calls UMIClusterer and applys the results to the current bundle.

Currently the 'UMIClusterer' if of the format:

final_umis, umi_counts, topologies, nodes = UMIClusterer(umis, counts, threshold, stats, further_stats, deduplicate)

final_umis contains a list of umis if deduplicate is true, or returns a list of umi groups if it is False.

Two problems with this so far

1) It seems uncesscary to pass umis to the call as the umis should be contained as the keys of the counts directionary, but missing it out and starting the the call with umis = counts.keys() results in complex changes to the order of output making it difficult to verify the outputs are equivalent in the case of grouping.

2) Currently stats computations are done inside UMIClusterer. This seems wrong are it is not part of the core algo. Also the stats computations currently use lots of python specific concepts that make make transcoding to C difficult. It would also make changing the way stats are computed in the future difficult. Calculation of the stats is depended on details of the implementation. To export the stats calculations we could need to expose parts of the intermediate calculations to the user, and futher, the user would also need to understand details of the methods to compute their own stats, which kind of goes against the spirit of abstracting the algos into a separate library for easy reuse.

I feel like the ideal call signature would be: final_umis = UMIClusterer(counts, threshold, deduplicate)

But this would loose too much infomation to be able to calculate stats.

@TomSmithCGAT opinions?

TomSmithCGAT commented 7 years ago

Hi @IanSudbery. This is a great idea. I'm very much on board!

Regarding stats, as I see it the issue only relates to collecting the 'further stats' which provide information on the topology of the networks. Essentially this was added to dedup to generate statistics for the publication and I don't see it as a core functionality of dedup. It was implemented in a 'least hassle' manner without any real thought to maintaining the code long term. Personally, I'm happy to to drop the topology stats entirely. However, if we did want to keep this option, I think we could retain it quite more easily as shown below.

As you point out, ReadClusterer now returns different objects depending on the UMI-tools command: list of umis (dedup) or groups of umis (group). This would be easier if the ReadClusterer was doing less of the work. Right now it's making the adjacency list, building the network and then either getting the groups or identifying the best read. One of the issues I had implementing the group command was making sure the groups were exactly the same as the theoretical groups from dedup although these are never determined. Perhaps we should always explicitly determine the groups first regardless and only then identify the top read? I'm not sure whether the following would make the C transcoding easier or harder though?

networks = UMINetworker(counts, threshold) # C function

# If we retain further stats
if options.further_stats():
    for network in networks:
        topology_stats.update(summarise_toplogy(network)) # Undetermined method to update stats

for network in networks:
    groups = getGroups(network, options.method) # C function

    if not deduplicate: # if running group command write out and move on
        write_groups... # write out groups
        continue

    best_read = GetBestRead(group) # else write out the best read
        write_read... # write read
IanSudbery commented 7 years ago

Hi Tom,

I'd be happy to drop the further stats. I don't know if anyone actaully uses them.

As for the seperating out the steps ... I thought about this, but I would like users to be able to use the library without really understanding what is going on under the hood. So I would like everything to happen in one step ideally.

Of course, I can just about imagine a scenario where one command is run to select the groups and another to choose the best read from the group. Trouble is the two are connected by the method, and so you have to understand the methods(?) to use it properly.

TomSmithCGAT commented 7 years ago

Yeah, let's drop it then. This will of course cause a minor backwards compatibility issue if anyone's using the option in a pipeline etc but I doubt anyone is using it either. We can always return to this if required.

Each group contains a set of unique reads/UMIs (for a given location). If the C library function took UMI counts as input and returned ordered groups as output, the top read would just be group[0]. I think this would be a reasonable output and would ensure the outputs of dedup and group are compatible (e.g one could derive the dedup output from the group output).

IanSudbery commented 7 years ago

Is that the case irrespective of the method?

TomSmithCGAT commented 7 years ago

Yup. Any one read should only be assigned to a single group and the read with the highest counts is always considered the 'best' read.

IanSudbery commented 7 years ago

Not true. Your group directional assumes that nodes in a cluster are ordered by count, but _get_connected_components_adjacency does not guarantee this.

This may also be why the recursive search and the breadth first search return different things.

TomSmithCGAT commented 7 years ago

Umm...that's a concern! I'll have a look into this later. Thanks!

IanSudbery commented 7 years ago

If I insert a sort statement in _group_directional:

 def _group_directional(self, clusters, adj_list, counts):
        ''' return groups for directional method'''

        observed = set()
        groups = []

        for cluster in clusters:
            if len(cluster) == 1:
                groups.append(list(cluster))
                observed.update(cluster)
            else:
                cluster = sorted(cluster, key = lambda x: counts[x], # <--- NEW SORT HERE
                                 reverse = True) 
                # need to remove any node which has already been observed
                temp_cluster = []
                for node in cluster:
                    if node not in observed:
                        temp_cluster.append(node)
                        observed.add(node)
                groups.append(temp_cluster)

        return groups

then the first UMI in each group matches the output for deduplicating.

IanSudbery commented 7 years ago

And, with the above change, enabling the recursive search now gives the same result.

TomSmithCGAT commented 7 years ago

Yes, just came to the same conclusion myself. Thanks for spotting this

TomSmithCGAT commented 7 years ago

Ah, brilliant!

IanSudbery commented 7 years ago

Do you want me to patch the sort into the master?

TomSmithCGAT commented 7 years ago

Yes please! Checking back on the previous issue regarding the recursive search (#69), I was getting a different result for dedup with directional with a high depth sample (not covered by tests) so I suspect this is still the case given the code for dedup isn't affected by this patch?

IanSudbery commented 7 years ago

Okay, lastest push (see f48e99d) contains the fully abstracted code as we discussed. I've also reimplemented group for adjacency to make it quicker as it initially doubled the run time. This has the benefit of making the code more compact.

I've tested against the nosetests, and a couple of full size files (directional and adjacency). Things are the same up to a sort order.

I've also created a simple unit test of the new abstracted code, based on figure 1 of the paper for development purposes.

TomSmithCGAT commented 7 years ago

Great. So we're ready to try out a C library for the networks.

Do you want me to go through and remove the further stats option and documentation - This will fail currently if the --further-stats options is supplied though I guess that's not a problem on this branch?

TomSmithCGAT commented 7 years ago

The code_algos_in_c branch has been merged into master (#117, #118)