Open eboileau opened 1 year ago
writing to disk of
*.tracks_vs_regions.rankings.feather
is commented out, so I guess we don't need this file, right?
No, this file does not need to be saved. It is only temporarily needed in memory. Writing this file takes quite a bit of time (if you have a lot of regions).
The second step would only be needed if iyou created the the first step in multiple steps (when using --partial CURRENT_PART NBR_TOTAL_PARTS
). If you have *.regions_vs_tracks.rankings.feather
, you are fine.
@eboileau What was the full command that you ran? I suspect that your input feather file you used was not in the proper format.
Traceback (most recent call last):
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 175, in <module>
main()
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py", line 110, in main
ct_scores_db_motifs_or_tracks_vs_regions_or_genes = CisTargetDatabase.read_db(
File "/beegfs/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/create_cisTarget_databases/cistarget_db.py", line 980, in read_db
all_column_names = pa_table.column_names
UnboundLocalError: local variable 'pa_table' referenced before assignment
Thanks for your reply.
I ran the first step create_cistarget_track_databases.py
, as explained above, using a small dataset example of 10 tracks. It's been some time, but I think this is the right log output:
Initialize dataframe (92634 regions x 10 tracks) for storing track scores for each regions per track.
Adding bigWigAverageOverBed track scores (1 of 10) for track "ERX3374317" took 0.103590 seconds.
Adding bigWigAverageOverBed track scores (2 of 10) for track "ERX3374325" took 0.066980 seconds.
Adding bigWigAverageOverBed track scores (3 of 10) for track "ERX3374319" took 0.054182 seconds.
Adding bigWigAverageOverBed track scores (4 of 10) for track "ERX3374323" took 0.061743 seconds.
Adding bigWigAverageOverBed track scores (5 of 10) for track "ERX3374321" took 0.045412 seconds.
Adding bigWigAverageOverBed track scores (6 of 10) for track "ERX3374320" took 0.027367 seconds.
Adding bigWigAverageOverBed track scores (7 of 10) for track "ERX3374324" took 0.056523 seconds.
Adding bigWigAverageOverBed track scores (8 of 10) for track "ERX3374318" took 0.048575 seconds.
Adding bigWigAverageOverBed track scores (9 of 10) for track "ERX3374322" took 0.035871 seconds.
Adding bigWigAverageOverBed track scores (10 of 10) for track "ERX3374326" took 0.035685 seconds.
Create rankings from "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather" with random seed set to 1984.
Writing cisTarget regions vs tracks scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather"
Writing cisTarget regions vs tracks scores db took: 0.106667 seconds
Writing cisTarget tracks vs regions scores db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.scores.feather"
Writing cisTarget tracks vs regions scores db took: 19.852678 seconds
Creating cisTarget rankings db from cisTarget scores db took: 0.174052 seconds
Writing cisTarget tracks vs regions rankings db: "/prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.regions_vs_tracks.rankings.feather"
Writing cisTarget tracks vs regions rankings db took: 7.912627 seconds
Using the output of step 1, I then ran step 2
./convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py \
-i /prj/LZ_PR2B_rewiring/PR2B_Oct21/data_cm_2023/resources/tracks/test.mm10.tracks_vs_regions.scores.feather \
--seed 1984
which resulted in the above error.
The files are here: https://data.dieterichlab.org/s/8Npf6r5bGdfaASM
Thanks for providing the test files.
In case you would need to run 1convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py
in the future, the bug is fixed (missing 1 character): https://github.com/aertslab/create_cisTarget_databases/commit/9572f4147bcd9853571a687a2b3833241bff641d#diff-09ae4856d6de6fde66c2e4398454f6d41a333c12d018f20b0f0969b2f5b95cffR508
Hi @ghuls thanks for taking time to fix this. I'd appreciate your comments on some more general questions:
If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few questions:
- For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?
Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience?
- Most importantly, how do I generate a matching "motif database" to use with
--annotations_fname
( files like those under track2tf )?
Do I have to generate this from scratch? Are you aware of any resource for mouse?
- Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
?
Ok, so now I have my *regions_vs_tracks.rankings.feather to run pySCENIC. What about the difference between "regions" or "genes", i.e. more in terms of overall performance, do you have any experience? Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
pySCENIC will only work with gene based rankings databases as it will also use the expression data. Region based databases are used in SCENIC+ as you would have access to ATAC data.
But in general region databases are "better" than gene databases as they have a higher resolution (way more number of regions than number of genes). The disadvantage of region based databases is that you would need to associate a region with a downstream affected gene. In the gene based databases we assume that a e.g 10kb area around the TSS is correlated with the expression of the gene, but this dismisses distal enhancers.
Do I have to generate this from scratch? Are you aware of any resource for mouse? Yes you will have to generate them from scratch. Here are some old examples from files in that format:
https://resources.aertslab.org/cistarget/track2tf/
You will need a file with the following header: #motif_id\tgene_name\tmotif_similarity_qvalue\torthologous_identity\tdescription\n
.
The format was originally only meant for motifs and not tracks, so the header is a bit weird:
motif_id
you use your track name (ChIP-seq names in your database).gene_name
you use the TF name associated with this dataset.motif_similarity_qvalue
: 0
(no motif similarity used as we have tracks).orthologous_identity
: 1
(no orthologous annotation from another species as we assume you don't use ChIP-seq data from another species for scoring)description
: whatever metadata you like (not used)-- | -- | -- | -- | -- accession | TF | 0 | 1 | description ENCFF522XXJ | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF355QJK | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF963FLO | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF013PDC | pnt | 0 | 1 | eGFP-pnt (embryonic: 0-24 hour) ENCFF508BVS | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) ENCFF560OIN | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown ) ENCFF819BQU | kmg | 0 | 1 | eGFP-CG5204 (adult: unknown )
BTW, our SCENIC+ public motif collection is now available: https://resources.aertslab.org/cistarget/motif_collections/
moitf2TF snapshots for that collection can be found at: https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/snapshots/
To create annotation for other species (if you would make one for non-human,mouse,fly) try to find orthologs for your species for the gene name mentioned in the gene_name (column 6) column and replace it with the gene name of your species. If you don't have an ortholog, for that gene, delete the line.
Thanks @ghuls this is really useful. Let me take a few days to digest all this and go back to my project, then I'll mark this issue as resolved.
First off, thanks for putting this together for the community.
I am trying to create a cisTarget track DB for mouse to use with pySCENIC, using a number of ChIP-seq (bigWig) files. I am aware of the multiple (still open) issues on related topics, and a tutorial e.g. PBMC10k_SCENIC-protocol-CLI-tracks, etc., but I must say, despite all that and the instructions given here, I am still a bit unsure about what is a correct workflow.
First step, score-all-tracks-at-once-and-create-rankings. Here, I am using
--tracks
to pass a list of bigWig files, and for--bed
I am using the files that you provide under regions, e.g. mm10-limited-upstream500-tss-downstream100-full-transcript.bed. This steps outputs the files described in the instructions, except *.tracks_vs_regions.rankings.feather. I have 1 question for this step in particular:Second step, I think from here instructions are a bit unclear for track-based annotations... According to my understanding, I need to run convert_motifs_or_tracks_vs_regions_or_genes_scores_to_rankings_cistarget_dbs.py using the output *.tracks_vs_regions.scores.feather from step 1. But it looks like it's not even reading in the file (same whether we use regions or genes with
--genes
):If Step 1 (and Step 2) above complete successfully, this would give me a "ranking database" to run pySCENIC. Here I have a few quetions:
For the "ranking database", actually which feather file should I use? (regions_vs_tracks.rankings.feather`? tracks_vs_tracks.rankings.feather? I don't know what would be the output of Step 2?). And actually do you recommend using "regions" or "genes" (e.g regions_vs_tracks.rankings vs. genes_vs_tracks.rankings) ?
Most importantly, how do I generate a matching "motif database" to use with
--annotations_fname
( files like those under track2tf )?Finally, a more general questions: How does this compare with region-based databases that you provide? Are these only usable with pycisTarget/SCENIC+?
Thanks for taking time to clarify, I'm sure this will help others as well.
I am running under:
SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
and that's my environment: