cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
56 stars 8 forks source link

Writing dataset to disk progress remaining at 0% #47

Open jalwillcox opened 1 month ago

jalwillcox commented 1 month ago

I am working with a large single-nucleus ATAC dataset (150k nuclei, 360k peaks), but have gotten stuck at the "Caching data to disk" section of the "Atlas-level integration" tutorial.

Each time I run the lines:

model.write_ondisk_dataset(train, dirname='./data/multimodal/mira/atac/atac_train')
model.write_ondisk_dataset(test, dirname='./data/multimodal/mira/atac/atac_test')

Writing dataset to disk progress stays at 0% (for >30min). When I stop it, it looks like it has only gotten as far as writing a header line in ./data/multimodal/mira/atac/atac_train/dataset_meta.pkl

I tried subsetting the data to include only 35k nuclei and 340k peaks, but run into the same issue.

My versions: mira: 2.1.1 python: 3.10.12

Do you have any thoughts on why this might be happening?

Thank you! Jon

AllenWLynch commented 1 month ago

Hi Jon!

I have not seen this error before. Can you try sub-setting to even smaller dataset sizes to rule out a file IO problem please? If this doesn't work, the next thing may be to share a small segment of the data with me so I can reproduce.

Best, Allen


From: jalwillcox @.> Sent: Monday, August 5, 2024 12:41 PM To: cistrome/MIRA @.> Cc: Subscribed @.***> Subject: [cistrome/MIRA] Writing dataset to disk progress remaining at 0% (Issue #47)

I am working with a large single-nucleus ATAC dataset (150k nuclei, 360k peaks), but have gotten stuck at the "Caching data to disk" section of the "Atlas-level integration" tutorial.

Each time I run the lines:

model.write_ondisk_dataset(train, dirname='./data/multimodal/mira/atac/atac_train') model.write_ondisk_dataset(test, dirname='./data/multimodal/mira/atac/atac_test')

Writing dataset to disk progress stays at 0% (for >30min). When I stop it, it looks like it has only gotten as far as writing a header line in ./data/multimodal/mira/atac/atac_train/dataset_meta.pkl

I tried subsetting the data to include only 35k nuclei and 340k peaks, but run into the same issue.

My versions: mira: 2.1.1 python: 3.10.12

Do you have any thoughts on why this might be happening?

Thank you! Jon

— Reply to this email directly, view it on GitHubhttps://github.com/cistrome/MIRA/issues/47, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE43JPGWNHMU6J35OGYFAMLZP62LBAVCNFSM6AAAAABMAXXX3GVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2DSMJRGYYTGMI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jalwillcox commented 4 weeks ago

Hi Allen,

Thank you for the response!

I stripped the dataset down to a small subset of cells and peaks (~1000 cells and ~20000 peaks), and it worked.

Someone else in my lab has tried with the full dataset, though, and got it to work in a matter of minutes (the resources allocated to our environments are identical). That's got us wondering if it's a package versioning issue, but I'm not sure which packages might cause it. I've tried matching his versions for several packages (listed below), but haven't had any luck yet.

mira:2.1.1
scanpy:1.9.6
numpy:1.24.0
torch:2.0.0+cu117
pandas:2.2.2
anndata:0.10.3
tqdm:4.66.1
logging:0.5.1.2
pyro:1.8.6

Do you know what other packages might be involved that could lead to this behavior?

Thanks again!

Jon

jalwillcox commented 3 weeks ago

Ah, it looks like the problem was that an earlier line I was running to calculate highly variable peaks had made the sparse ATAC matrix dense, and I hadn't reset it to the sparse matrix... My mistake, but it seems to be working now!

Thank you for your help! Jon