TheJacksonLaboratory / diachromatic

Diachromatic is a Java application for preprocessing and quality control of Hi-C and CHi-C data.
https://diachromatic.readthedocs.io/en/latest/
GNU General Public License v3.0
3 stars 1 forks source link

Dedupmap #68

Closed pnrobinson closed 5 years ago

pnrobinson commented 5 years ago

I do not understand why we need these variables: insertion_num = 0; first_coord_num = 0; second_coord_num = 0;

Also, can we rename chr_pair_key_num to simply n_chromosomes?

hansenp commented 5 years ago

I think something went wrong with the branches. This branch (dedupmap) includes lots of commits that I pushed actually to the branch I'm currently working on which has the name iss64-count-command and I made a pull request for this branch yesterday:

Iss64 count command (https://github.com/TheJacksonLaboratory/diachromatic/pull/67)

The network graph looks now as follows:

image

Not sure what's the best way resolve this. Maybe we should simply merge the new count command and the new Junit5 tests into the develop. But this requires your review of the pull request Iss64 count command (https://github.com/TheJacksonLaboratory/diachromatic/pull/67).

For my part, I will answer your questions above and try out the new tests.

@pnrobinson: Maybe we can merge everything together during our next session on Skype.

hansenp commented 5 years ago

I do not understand why we need these variables: insertion_num = 0; first_coord_num = 0; second_coord_num = 0;

I added these variables for benchmarking and sanity checking purposes. The goal was to develop a memory efficient data structure that is able to keep track of all read pairs that have been already seen. Using the variables above, I showed that the DeDupMap requires only 1207 strings of the form chr1chr14 in order to keep track off 65,000 pairs. The naive approach would require to store 2 (one for each pair) times 65,000 strings of the form chr1.

@pnrobinson: I sent you an explanatory pdf file a few days ago. I also attached it to this comment (dedup_map.pdf). We can go through it together during our next Skype session.

Also, can we rename chr_pair_key_num to simply n_chromosomes?

Since we have stings that consist of pairs of chromosome names, and these strings are used as keys for a hash map, I think that chr_pair_key_num better reflects the content of the variable than n_chromosomes.

hansenp commented 5 years ago

I noticed that you also made some changes to the class DeDupMap, mainly replacements of Integer by int but also this:

image

@pnrobinson: I'm never sure how to write things like this. Please explain!

I tested on the small test dataset in src/test/resources/data/testDeDup/. Everything seems to work as before.

You added the test: src/test/java/org/jax/diachromatic/align/DeDupMapTest.java. This test is using mock objects and is well documented. The cases that are tested look reasonable. But there could be more tests with respect to orientation. For instance, two read pairs with identical positions but different orientations that cannot originate from a duplication event.

Finally, I noticed that you changed InteractionCountsMapTest. @BeforeClass was replaced by @BeforeAll. Why?