Consider PAR1/2 on Chr X as autosomal

rollf commented 1 year ago

This PR addresses issue #789. It adds the optional feature to treat the PAR1/2 on chromosome X as autosomal, i.e. when requesting autosomal regions these regions are included and on the other hand, when requesting "chromosome X", they are excluded. For chr Y, the PAR1/2 regions are excluded when requesting "chromosome Y" since they cannot be covered if all reads map to PAR1/2 on X. The results have already been shown in the issue mentioned above.

This PR contains the following changes:

Add new command line argument --diploid-parx-genome which can be set to grch37 or grch38 in order to treat PAR1/2 GRCh37/38 regions on chromosome X as autosomal.
This argument is added to all subcommands where necessary and forwarded.
The method GenomicArray.autosomes() has been re-implemented in CopyNumArray.autosomes() to be able to transparently include/exclude PAR1/2 on demand.
Conversely, wherever chromosome X regions have been requested (ex: (cnarr.chromosome == cnarr._chr_x_label)), this has been replaced by a call to CopyNumArray.chr_x_filter() which includes/excludes PAR1/2 as needed. Same for chromosome Y.
No differentiation is done based on the sex of the sample.
This approach has the advantage of having only one place where if/else semantics have to implemented. Whenever the code treats chromosome X, one doesn't have to reason about whether PAR1/2 is include or not.
This approach has disadvantages: If enabled, the regions for chrX will be split even for female samples (where it doesn't make sense, as opposed to for male samples [see #784 for an explanation]). Also, one has to rely on the user to write chrx = self[self.chrx_filter()] instead of self[self.chromosome == self.chr_x_label] (or even `self[self.chromosome == "chrX"]) - but it's in the nature of Python to give full access to object members.
Fix a bug in call.absolute_dataframe().
Add read_ga similar to read_cna() and use where appropriate (=GenomicArray tests)
I fixed the version of pomegranate to 0.14.8. See comments.

I will add some further review comments.

rollf commented 1 year ago

@etal Could you have a look into this when time permits?

(I believe the conflicts arose due to new formatting)

etal commented 1 year ago

Thanks! I'm reviewing this PR now. Sorry for forcing a rebase, but hopefully the codebase is a little easier to work with now.

rollf commented 1 year ago

@etal Thanks for reviewing. You haven't said whether you're globally fine with my approach but I assume it's okay then. I will change the code towards adding the treat_par... parameter into the meta dict. This should then already clean up things a little bit.

etal commented 1 year ago

You haven't said whether you're globally fine with my approach but I assume it's okay then. I will change the code towards adding the treat_par... parameter into the meta dict.

Yes, I'm globally fine with your approach, and I appreciate your efforts here. My only restriction in this PR is that read_cna(filename) and CNA(table) should work as expected on a non-human genome like yeast. For human-specific behavior I prefer named keyword arguments over positionals, but you can see I haven't been consistent with that either, so I won't fuss too much about it.

rollf commented 1 year ago

Hey @etal,

thanks again for your feedack. I have been working on other topics but I had also tried to make headway here. However, I was really unsatisfied with my implementation. Mainly, because I feel like I am not able ensure consistency (in terms of behavior) across the whole code base (i.e. I can't make sure that the necessary parameter(s) is/are always available when creating/dealing with CopyNumberArray objects). I'm currently pursuing a more global approach and will of course get back to you once I have something to show.

rollf commented 1 year ago

@etal I worked through the code base again and I'm now done. Can you please carefully review this again? I have updated the initial PR comment as well as added further comments since I have changed my mind and found a better way (IMHO) to implement this optional feature. The tests are currently failing due to dependency issues (see my comment in core.txt). I had also used blackd but refrained from doing so because I felt like I'm not using the same configuration (some parts of the code were formatted which I didn't touch).

Here are 2 x 3 plots. For the male & the female sample from a mixed pool, I have plotted chr X once based on the current master branch, once based on my branch without the feature enabled and once with the feature enabled. The male bias towards a gain and the female bias towards a loss are fixed (last plot for each sex).

male: male01 male01 male01

female: female01

rollf commented 1 year ago

Hi @etal,

thanks for the feedback. I renamed the file as you suggested. There are still two tests failing. For me, this looks like a dependency issue (after pinning pomegranate). Let me know how you would like to handle this.

rollf commented 1 year ago

Hi @etal,

I will be on a longer summer break for July & August. Do you think we could finish this before end of June? I'm happy to support you if needed but I don't see anything I could do for now.

etal commented 1 year ago

Sure. I'm not too fond of the tests py3.8-min and py3.10-macos. Ignoring those tests, do you feel this PR is ready to merge?

rollf commented 1 year ago

@etal Yes, this is ready for merge IMHO.

etal commented 1 year ago

All done, squashed and merged. Thanks for your contribution!

rollf commented 1 year ago

Yeah! :rocket:

etal / cnvkit

Consider PAR1/2 on Chr X as autosomal #789