PlantProteomes / SeqComparison

A project for comparing plant proteome sequences
Apache License 2.0
0 stars 2 forks source link

Let's also add the Unique column #14

Open edeutsch opened 2 years ago

edeutsch commented 2 years ago

Until now we've ignored the "Unique" column because it is a bit hard. But now that you have more skill, it seems attainable.

The unique column represents how many sequences in that data source are uniquely found in that one source and not any other sources. We can can still count it once if it appears twice in a single source but not any other sources.

This is harder because you can't compute the answer until you've gone through all the files. There seem like two broad approaches: 1) assemble a dict of how many data sources each sequences is seen as you pass through all the files and then at the end, compute the uniqueness and back-fill the table (matrix) 2) Or perhaps a bit easier conceptually, but requiring more computation, is to take a read through all the files first before you start the main part of your matrix program. i.e. read through all the files compiling which files contain which sequences and collate uniqueness, and then proceed with your existing matrix program with this information already available in a dictionary. This requires two read-throughs of the files

Give it a try and let's discuss issues later this week