aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

What is the correct workflow for creating tracks DBs? #35

Open eboileau opened 1 year ago

eboileau commented 1 year ago

First off, thanks for putting this together for the community.

I am trying to create a cisTarget track DB for mouse to use with pySCENIC, using a number of ChIP-seq (bigWig) files. I am aware of the multiple (still open) issues on related topics, and a tutorial e.g. PBMC10k_SCENIC-protocol-CLI-tracks, etc., but I must say, despite all that and the instructions given here, I am still a bit unsure about what is a correct workflow.

  1. First step, score-all-tracks-at-once-and-create-rankings. Here, I am using --tracks to pass a list of bigWig files, and for --bed I am using the files that you provide under regions, e.g. mm10-limited-upstream500-tss-downstream100-full-transcript.bed. This steps outputs the files described in the instructions, except *.tracks_vs_regions.rankings.feather. I have 1 question for this step in particular:

  2. Second step, I think from here instructions are a bit unclear for track-based annotations... According to my understanding, I need to run convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py using the output *.tracks_vs_regions.scores.feather from step 1. But it looks like it's not even reading in the file (same whether we use regions or genes with --genes):

Traceback (most recent call last):
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 175, in <module>
    main()
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 110, in main
    ct_scores_db_motifs_or_tracks_vs_regions_or_genes = CisTargetDatabase.read_db(
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/cistarget_db.py", line 980, in read_db
    all_column_names = pa_table.column_names
UnboundLocalError: local variable 'pa_table' referenced before assignment
  1. If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few quetions:

    • For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?

    • Most importantly, how do I generate a matching "motif database" to use with --annotations_fname ( files like those under track2tf )?

  2. Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?

Thanks for taking time to clarify, I'm sure this will help others as well.


I am running under: SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux

and that's my environment:

_libgcc_mutex             0.1                 conda_forge    conda-forge                                                                                                    
_openmp_mutex             4.5                       2_gnu    conda-forge                                                                                                    
arrow-cpp                 10.0.1           ha770c72_6_cpu    conda-forge                                                                                                    
aws-c-auth                0.6.22               hd93a3ba_0    conda-forge                                                                                                    
aws-c-cal                 0.5.20               hff2c3d7_3    conda-forge                                                                                                    
aws-c-common              0.8.5                h166bdaf_0    conda-forge                                                                                                    
aws-c-compression         0.2.16               hf5f93bc_0    conda-forge                                                                                                    
aws-c-event-stream        0.2.18               h57874a7_0    conda-forge                                                                                                    
aws-c-http                0.7.0                h96ef541_0    conda-forge
aws-c-io                  0.13.12              h57ca295_1    conda-forge    
aws-c-mqtt                0.7.13              h0b5698f_12    conda-forge
aws-c-s3                  0.2.3                hc08f4d7_1    conda-forge
aws-c-sdkutils            0.1.7                hf5f93bc_0    conda-forge
aws-checksums             0.1.14               h6027aba_0    conda-forge
aws-crt-cpp               0.18.16             h20a6995_11    conda-forge
aws-sdk-cpp               1.10.57              ha834a50_1    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
gflags                    2.2.2             he1b5a44_1004    conda-forge
glog                      0.6.0                h6f12383_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               h81ceb04_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libabseil                 20220623.0      cxx17_h05df665_6    conda-forge
libarrow                  10.0.1           hf9c26a6_6_cpu    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libcrc32c                 1.1.2                h9c3ff4c_0    conda-forge
libcurl                   7.87.0               hdc1c0ab_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h28343ad_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge                                                                                                    
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libgoogle-cloud           2.5.0                h21dfe5b_1    conda-forge
libgrpc                   1.51.1               h30feacc_0    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libllvm11                 11.1.0               he0ac6c6_5    conda-forge
libnghttp2                1.51.0               hff17c54_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libprotobuf               3.21.12              h3eb15da_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               hf14f497_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libthrift                 0.16.0               he500d00_2    conda-forge
libutf8proc               2.8.0                h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
llvmlite                  0.39.1          py310h58363a5_1    conda-forge
lz4-c                     1.9.4                hcb278e6_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
numba                     0.56.4          py310ha5257ce_0    conda-forge
numpy                     1.21.6          py310h45f3432_0    conda-forge
openssl                   3.0.7                h0b41bf4_2    conda-forge
orc                       1.8.2                hfdbbad2_0    conda-forge
pandas                    1.5.3           py310h9b08913_0    conda-forge
parquet-cpp               1.5.1                         2    conda-forge
pip                       23.0               pyhd8ed1ab_0    conda-forge
pyarrow                   10.0.1          py310h633f555_6_cpu    conda-forge
python                    3.10.8          h4a9ceb5_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-flatbuffers        23.1.21            pyhd8ed1ab_0    conda-forge
python_abi                3.10                    3_cp310    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
re2                       2022.06.01           h27087fc_1    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
s2n                       1.3.31               h3358134_0    conda-forge
setuptools                66.1.1             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.9                hbd366e4_2    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
tzdata                    2022g                h191b570_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge
ghuls commented 1 year ago

writing to disk of *.tracks_vs_regions.rankings.feather is commented out, so I guess we don't need this file, right?

No, this file does not need to be saved. It is only temporarily needed in memory. Writing this file takes quite a bit of time (if you have a lot of regions).

The second step would only be needed if iyou created the the first step in multiple steps (when using --partial CURRENT_PART NBR_TOTAL_PARTS). If you have *.regions_vs_tracks.rankings.feather, you are fine.

ghuls commented 1 year ago

@eboileau What was the full command that you ran? I suspect that your input feather file you used was not in the proper format.

Traceback (most recent call last):
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 175, in <module>
    main()
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 110, in main
    ct_scores_db_motifs_or_tracks_vs_regions_or_genes = CisTargetDatabase.read_db(
  File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/cistarget_db.py", line 980, in read_db
    all_column_names = pa_table.column_names
UnboundLocalError: local variable 'pa_table' referenced before assignment
eboileau commented 1 year ago

Thanks for your reply.

I ran the first step create_cistarget_track_databases.py, as explained above, using a small dataset example of 10 tracks. It's been some time, but I think this is the right log output:

Initialize dataframe (92634 regions x 10 tracks) for storing track scores for each regions per track.
Adding bigWigAverageOverBed track scores (1 of 10) for track "ERX3374317" took 0.103590 seconds.
Adding bigWigAverageOverBed track scores (2 of 10) for track "ERX3374325" took 0.066980 seconds.
Adding bigWigAverageOverBed track scores (3 of 10) for track "ERX3374319" took 0.054182 seconds.
Adding bigWigAverageOverBed track scores (4 of 10) for track "ERX3374323" took 0.061743 seconds.
Adding bigWigAverageOverBed track scores (5 of 10) for track "ERX3374321" took 0.045412 seconds.
Adding bigWigAverageOverBed track scores (6 of 10) for track "ERX3374320" took 0.027367 seconds.
Adding bigWigAverageOverBed track scores (7 of 10) for track "ERX3374324" took 0.056523 seconds.
Adding bigWigAverageOverBed track scores (8 of 10) for track "ERX3374318" took 0.048575 seconds.
Adding bigWigAverageOverBed track scores (9 of 10) for track "ERX3374322" took 0.035871 seconds.
Adding bigWigAverageOverBed track scores (10 of 10) for track "ERX3374326" took 0.035685 seconds.

Create rankings from "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather" with random seed set to 1984.
Writing cisTarget regions vs tracks scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather"
Writing cisTarget regions vs tracks scores db took: 0.106667 seconds

Writing cisTarget tracks vs regions scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.scores.feather"
Writing cisTarget tracks vs regions scores db took: 19.852678 seconds

Creating cisTarget rankings db from cisTarget scores db took: 0.174052 seconds

Writing cisTarget tracks vs regions rankings db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.rankings.feather"
Writing cisTarget tracks vs regions rankings db took: 7.912627 seconds

Using the output of step 1, I then ran step 2

./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py \
     -i /prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather \
     --seed 1984

which resulted in the above error.

The files are here: https://data.dieterichlab.org/s/8Npf6r5bGdfaASM

ghuls commented 1 year ago

Thanks for providing the test files.

In case you would need to run 1convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py in the future, the bug is fixed (missing 1 character): https://github.com/aertslab/create_cisTarget_databases/commit/9572f4147bcd9853571a687a2b3833241bff641d#diff-09ae4856d6de6fde66c2e4398454f6d41a333c12d018f20b0f0969b2f5b95cffR508

eboileau commented 1 year ago

Hi @ghuls thanks for taking time to fix this. I'd appreciate your comments on some more general questions:

  1. If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few questions:

    • For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?

Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience?

  • Most importantly, how do I generate a matching "motif database" to use with --annotations_fname ( files like those under track2tf )?

Do I have to generate this from scratch? Are you aware of any resource for mouse?

  1. Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?

?

ghuls commented 1 year ago

Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience? Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?

pySCENIC will only work with gene based rankings databases as it will also use the expression data. Region based databases are used in SCENIC+ as you would have access to ATAC data.

But in general region databases are "better" than gene databases as they have a higher resolution (way more number of regions than number of genes). The disadvantage of region based databases is that you would need to associate a region with a downstream affected gene. In the gene based databases we assume that a e.g 10kb area around the TSS is correlated with the expression of the gene, but this dismisses distal enhancers.

Do I have to generate this from scratch? Are you aware of any resource for mouse? Yes you will have to generate them from scratch. Here are some old examples from files in that format:

https://resources.aertslab.org/cistarget/track2tf/

You will need a file with the following header: #motif_id\tgene_name\tmotif_similarity_qvalue\torthologous_identity\tdescription\n.

The format was originally only meant for motifs and not tracks, so the header is a bit weird:

motif_id | gene_name | motif_similarity_qvalue | orthologous_identity | description

-- | -- | -- | -- | -- accession | TF | 0 | 1 | description ENCFF522XXJ | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF355QJK | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF963FLO | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF013PDC | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF508BVS | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) ENCFF560OIN | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) ENCFF819BQU | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown )

BTW, our SCENIC+ public motif collection is now available: https://resources.aertslab.org/cistarget/motif_collections/

moitf2TF snapshots for that collection can be found at: https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/snapshots/

To create annotation for other species (if you would make one for non-human,mouse,fly) try to find orthologs for your species for the gene name mentioned in the gene_name (column 6) column and replace it with the gene name of your species. If you don't have an ortholog, for that gene, delete the line.

eboileau commented 1 year ago

Thanks @ghuls this is really useful. Let me take a few days to digest all this and go back to my project, then I'll mark this issue as resolved.