aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

feather file doesn't contains 'features' when input into pyscenic #6

Closed yuyun-zhang closed 3 years ago

yuyun-zhang commented 3 years ago

Hi @tropfenameimer,

Thanks for the great tool!

I wanted to use pySCENIC tool in plant. So I create my own plant cisTarget databases with the tool "create_cisTarget_databases". Everything is ok when running and I get 4 feather files:

command is: create_cistarget_motif_databases.py -f genebodyupd3k.fa -M cbust_motif/cb_db/ -m cbust.list -o genebodyupd3k -t 25

Issue-1: All these 4 feather files cannot be read by R package "RcisTarge" and R package "feather". The error showing below:

> library(RcisTarget) > a<-importRankings("genebodyupd3k.regions_vs_motifs.rankings.feather")

*** caught segfault *** address (nil), cause 'memory not mapped'

Traceback: 1: openFeather(path) 2: feather(path) 3: feather::read_feather(dbFile, columns = columns) 4: importRankings("gengbodyupd3k.regions_vs_motifs.rankings.feather")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

However, the feather file downloaded from database can be read. Filename is "hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather". Downloaded from https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather

Isuue-2:

I input the feather file into "pyscenic ctx" with command below: pyscenic ctx --annotations_fname motifs.anno.tbl --expression_mtx_fname expmax_lognorm.gene-symbol.csv --output regulons.csv --num_workers 25 adjacencies_corr.tsv genebodyupd3k.regions_vs_motifs.rankings.feather No matter any one of the 4 feather files is input into pyscenic, an error was reported: KeyError: "None of ['features'] are in the columns"

When input the downloaded feather file(hg38refseq-r8010kb_up_and_down_tss.mc9nr.feather), NO error.

It looks like there may be something wrong with the feather files. Can you help solve these two problems? Thank you!

tropfenameimer commented 3 years ago

hi @yuyun-zhang,

which version of RcisTarget are you using? please install the latest version from github and try again to load your feather file:

db <- importRankings("genebodyupd3k.regions_vs_motifs.rankings.feather", indexCol = "motifs")

(setting indexCol to "motifs" is important, because by default RcisTarget sets the first column as features or motifs column, but when you created your db with create_cistarget_motif_databases.py, the first column will contain a ranking of one of your regions)

you can also try to load it directly with arrow, to see if the feather file is ok:

db <- arrow::read_feather("gengbodyupd3k.regions_vs_motifs.rankings.feather")

you are having your second issue because pyscenic expects a column named 'features', but create_cistarget_motif_databases.py makes a data base with the motifs column named 'motifs'. a quick solution is to open your db (e.g. with arrow in R), change the name of the column to 'features', and save it again. we might add the features column name as an option to pyscenic soon.

yuyun-zhang commented 3 years ago

Hi @tropfenameimer,

Thanks for your reply!

I used RcisTarget(1.6.0) before, and now I install the version 1.10.0. But the following command does not work: db <- importRankings("genebodyupd3k.motifs_vs_regions.rankings.feather", indexCol = "regions") db <- importRankings("genebodyupd3k.regions_vs_motifs.rankings.feather", indexCol = "motifs") with the error: R Session Aborted R encountered a fatal error. The session was terminated. and then restart the R.

When I use arrow to load feather file, everthing is ok: db<-arrow::read_feather("genebodyupd3k.motifs_vs_regions.rankings.feather") I tried loading the file genebodyupd3k.regions_vs_motifs.rankings.feather all night, but it didn’t work. Because there are too many columns(107,891 genes x 380 motifs).

Is it possible to input files motifs_vs_regions.rankings.feather into pyscenic instead of regions_vs_motifs.rankings.feather? Also I will wait for the file regions_vs_motifs.rankings.feather to load and change the name of the column to 'features'.

I just tried to input renamed feather file motifs_vs_regions.rankings.rename.feather into pyscenic, and an error was reported: pyarrow.lib.ArrowInvalid: Not a feather file the code I changed the name of the column is: library(arrow) db<-read_feather("genebodyupd3k.motifs_vs_regions.rankings.feather") db2<-db[,c(381,1:380)] colnames(db2)[1]<-"features" write_feather(db2,"genebodyupd3k.motifs_vs_regions.rankings.rename.feather") and then run: pyscenic ctx --annotations_fname motifs.anno.tbl --expression_mtx_fname expmax_lognorm.gene-symbol.csv --output regulons.csv --num_workers 25 adjacencies_corr.tsv genebodyupd3k.motifs_vs_regions.rankings.rename.feather

So I guess the same error may be reported when input the renamed file genebodyupd3k.regions_vs_motifs.rankings.rename.feather.

Is there a problem with my steps?

tropfenameimer commented 3 years ago

hi @yuyun-zhang, please install the latest version of RcisTarget (1.11.10) from github:

devtools::install_github("aertslab/RcisTarget")

you really need the file *regions_vs_motifs.rankings.feather, scenic won't work with the other files. your renaming procedure looks correct to me, but you'd need to do this with the correct file.

it is strange that you can load the motifs_vs_regions file with arrow, but not the regions_vs_motifs file. they should both be of similar size, and from the number of genes & motifs you mentioned, not larger than 1Gb. maybe your file is corrupt? can you re-generate it, maybe with a subset of genes & motifs to test?

ghuls commented 3 years ago

@yuyun-zhang Can you run the following on all your Feather files?

ls -l ${feather_file}

hexdump -C -n ${feather_file}

There are 2 versions of the Feather format. The original, and the new ARROW IPC format.

# Feather v1 file.
❯ hexdump -C -n 8 test/ct_rankings_db_genes_vs_tracks.feather_version1.genes_vs_tracks.rankings.feather
00000000  46 45 41 31 00 00 00 00                           |FEA1....|
00000008

# Feather v2 file.
❯ hexdump -C -n 8 test/ct_rankings_db_genes_vs_tracks.feather_version2.genes_vs_tracks.rankings.feather
00000000  41 52 52 4f 57 31 00 00                           |ARROW1..|
00000008

For PySCENIC you need a Feather file in v1 format as Feather v2 is only supported from pyarrow>1.0.0, but pySCENIC is still stuck in lower versions as they removed the option to read the column names without loading the whole Feather file to memory.

As of yesterday, code to read the column names from Feather v1 and v2 files was added to this repo: https://github.com/aertslab/create_cisTarget_databases/commit/dcf70e60e915d2dc6850343960e7a7d3d3d56c41

so soon pySCENIC will be able to update to more recent versions of pySCENIC. pySCENIC will also be patched so you don't need to rename columns anymore.

Here is the patch for pySCENIC to handle the databases created by create_cisTarget_databases automatically. https://github.com/aertslab/ctxcore/pull/1/commits/21a5f72532f4ef558e7d1a2ccb0177f06a9dda15

(For now you have to change this file in pySCENIC: https://github.com/aertslab/pySCENIC/blob/master/src/pyscenic/rnkdb.py code base until it makes use of this ctxcore package)

yuyun-zhang commented 3 years ago

@tropfenameimer Thank you! I intall the latest version of RcisTarget (1.11.10), and I load the feather file motifs_vs_regions.rankings.feather successfully. But the file regions_vs_motifs.rankings.feather still cannot be loaded with RcisTarget and arrow. I want to test with a subset of genes, but my servers was busy today, my task has not been executed yet.

@ghuls Thank you too! The format of feather file output from create_cisTarget_databases.py is Feather v1:

>hexdump -C -n 8 genebodyupd3k.motifs_vs_regions.rankings.feather
00000000  46 45 41 31 00 00 00 00                           |FEA1....|
00000008

The format of renamed feather file output from arrow is Feather v2:

>hexdump -C -n 8 genebodyupd3k.motifs_vs_regions.rankings.rename.feather
00000000  41 52 52 4f 57 31 00 00                           |ARROW1..|
00000008

It's work after changing the code of rnkdb.py, and now I can successfully run pyscenic ctx without changing the column name of feather file.

Thank you two again!

ghuls commented 3 years ago

A new version of pySCENIO is out which uses the ctxcore pacakge, so now there is no need anymore the patch the code yourself: https://github.com/aertslab/pySCENIC/releases/tag/0.11.2