aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

export_pseudobulk runs correctly but does not write "done!" for one cluster #89

Closed massonix closed 1 year ago

massonix commented 1 year ago

Hi!

I ran the export_pseudobulk function for a dataset that includes 4 clusters. The function finished successfully, returning the expected paths and with the pseudobulks generated as expected in the proper paths. For 3 out of 4 clusters, I see the "done!" message being printed in the standard output. Why don't I see it for the 4th sample?

Context: I'm running export_pseudobulk in a SLURM HPC cluster with the following command:

bw_bed_path_dict = export_pseudobulk(
        input_data = cell_data, 
        variable = "annotation_20230727", 
        sample_id_col = "gem_id",
        chromsizes = chromsizes,
        bed_path = path_to_bed_files,
        bigwig_path = path_to_bw_files,
        path_to_fragments = path_fragments_dict,
        n_cpu = 12,
        normalize_bigwig = True,
        remove_duplicates = True,
        _temp_dir = path_tmp
)

This is the standard output:

2023-08-26 10:56:59,054 cisTopic     INFO     Reading fragments from /scratch/devel/rmassoni/richter_multiome/current/data/fragments_files/o2xlx1v6_sz8a2nvf_fragments_with_prefix_without_unconventional_chromosomes.tsv.gz
2023-08-26 10:58:56,293 cisTopic     INFO     Reading fragments from /scratch/devel/rmassoni/richter_multiome/current/data/fragments_files/tcv8g80g_txps9bam_fragments_with_prefix_without_unconventional_chromosomes.tsv.gz
(export_pseudobulk_ray pid=98112) 2023-08-26 11:01:08,233 cisTopic     INFO     Creating pseudobulk for Cluster1
(export_pseudobulk_ray pid=98109) 2023-08-26 11:01:09,297 cisTopic     INFO     Creating pseudobulk for Cluster2
(export_pseudobulk_ray pid=98110) 2023-08-26 11:01:10,633 cisTopic     INFO     Creating pseudobulk for Cluster3
(export_pseudobulk_ray pid=98111) 2023-08-26 11:01:12,060 cisTopic     INFO     Creating pseudobulk for Cluster4
(export_pseudobulk_ray pid=98111) 2023-08-26 11:03:31,720 cisTopic     INFO     Cluster4 done!
(export_pseudobulk_ray pid=98110) 2023-08-26 11:06:33,730 cisTopic     INFO     Cluster3 done!
(export_pseudobulk_ray pid=98109) 2023-08-26 11:20:22,411 cisTopic     INFO     Cluster2 done!

And the standard error:

2023-08-26 11:01:03,724 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(export_pseudobulk_ray pid=98110) /home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
(export_pseudobulk_ray pid=98110)   group_fragments = group_fragments_list[0].append(group_fragments_list[1:])
(export_pseudobulk_ray pid=98111) /home/groups/singlecell/rmassoni/anaconda3/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:274: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(export_pseudobulk_ray pid=98111)   group_fragments = group_fragments_list[0].append(group_fragments_list[1:]) [repeated 3x across cluster]

As you can see, the logger never prints the "Cluster1 done!" line, even though the function wrote 1.3Gb of output for Cluster1, code finished properly and without errors.

I'm using python 3.8.17 and SCENIC+ 1.0.1.dev3+g3741a4b

How could I diagnose if all the expected output for Cluster1 was indeed generated? Can I confidently continue analyzing the data?

Thanks!

SeppeDeWinter commented 1 year ago

Hi @massonix

If all the expected output for that cluster is indeed generated you can continue with the analysis without a problem. I'm not sure why it is not printing the Done message for that particular cluster.

The data should only be saved if the function finished properly, so I would not worry.

If you still worry about it, you can try and rerun it using a single core. This will avoid the use of multiprocessing (using ray) and you might see the Done message in that case.

All the best.

Seppe

massonix commented 1 year ago

That's perfect, thank you Seppe!

SeppeDeWinter commented 1 year ago

You're welcome