calico / scBasset

Sequence-based Modeling of single-cell ATAC-seq using Convolutional Neural Networks.
Apache License 2.0
96 stars 13 forks source link

peak matrix binarized? #25

Closed simon-anders closed 2 weeks ago

simon-anders commented 2 weeks ago

I'm currently working through your tutorial scripts. I've downloaded your AnnData object of the Buenrostro2018 data, as found here and trained the model with it. I noticed that the X matrix in the AnnData object contains read counts, but if I understand the paper right, you use binarized data.

Should I hence clip all data to {0,1}, i.e., replace any non-zero count with a one?

I see no mention of this in the tutorial code, but the binary cross entropy loss would not make sense without that, or would it?

Thanks in advance for your help.

hy395 commented 2 weeks ago

you don't need to do that explicitly. it is binarized before feeding into the model. https://github.com/calico/scBasset/blob/d31138b1d28adaa427444a54f41494fe74e4be3a/scbasset/utils.py#L346

simon-anders commented 2 weeks ago

Thanks a lot for the very fast reply. And thanks even more for pointing out the code position. I've been looking through the code for it but missed generator.call.