aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
420 stars 179 forks source link

df2regulons splitting gene names by character #509

Closed Tripfantasy closed 11 months ago

Tripfantasy commented 11 months ago

Hello, I have produced a regulons dataframe from the following code:

reg = pd.read_csv(REGULONS_FNAME, sep=',',index_col=False,header=1)
reg = reg.rename(columns={'Unnamed: 0': 'TF','Unnamed: 1':'MotifID'})
reg = reg.drop(0)
reg = reg.set_index(['TF','MotifID'])
reg.head()
image

Which represents expected results from the tutorial. However, when I run df2regulons() prior to running aucell and observe the list that should (to my understanding) contain the regulon name, list of genes and their weights, and other data from the initial data frame; the gene names seem oddly incorrect.

Regulon(name='Alx3(+)', gene2weight=frozendict.frozendict({'[': 1.0, '(': 1.0, "'": 1.0, 'P': 1.0, 'l': 1.0, 's': 1.0, 'c': 1.0, 'r': 1.0, '4': 1.0, ',': 1.0, ' ': 1.0, '0': 1.0, '.': 1.0, '7': 1.0, '9': 1.0, '6': 1.0, '8': 1.0, '5': 1.0, '1': 1.0, '2': 1.0, '3': 1.0, ')': 1.0, 'C': 1.0, 'p': 1.0, 'a': 1.0, 'I': 1.0, 'n': 1.0, 'M': 1.0, 'e': 1.0, 't': 1.0, 'F': 1.0, 'o': 1.0, 'x': 1.0, 'd': 1.0, 'R': 1.0, 'D': 1.0, 'i': 1.0, 'L': 1.0, 'T': 1.0, 'b': 1.0, 'E': 1.0, 'h': 1.0, 'N': 1.0, 'g': 1.0, 'f': 1.0, 'm': 1.0, 'K': 1.0, ']': 1.0, 'v': 1.0, 'S': 1.0, 'A': 1.0, 'k': 1.0, 'O': 1.0, 'G': 1.0, 'j': 1.0, 'U': 1.0, 'J': 1.0, 'W': 1.0, 'V': 1.0, 'w': 1.0, 'B': 1.0, 'H': 1.0, 'u': 1.0, 'z': 1.0}), gene2occurrence=frozendict.frozendict({}), transcription_factor='Alx3', context=frozenset({'metacluster_9.26.png', 'activating'}), score=3.343534473214638, nes=0.0, orthologous_identity=0.0, similarity_qvalue=0.0, annotation='')

Referring to #505 , the issue dealt with dtype conversion to string for several fields in the dataframe. This does not seem to be the case for this error, however:

print(reg.dtypes)

AUC                      float64
NES                      float64
MotifSimilarityQvalue    float64
OrthologousIdentity      float64
Annotation                object
Context                   object
TargetGenes               object
RankAtMax                float64
dtype: object

They appear to be splitting the string of each gene by character. I believe this is causing my aucell matrix to contain only zeros, as its reading incorrect gene names. Is there a work around/solution to this? Thanks.

Tripfantasy commented 11 months ago

Fixed this by running:

df_motifs = load_motifs(REGULONS_FNAME)

I suspect this is due to the transform.py module expecting a particular structure to the dataframe. Running this seemed to fix the structure to one compatible with df2regulons. Downstream this results in expected aucell matrix. Yay!