Closed orian116 closed 1 year ago
Can you show the first 10 lines of consensus_bed
?
Hi, I am getting the same error and first 10 lines of consensus_bed
file are:
GL000194.1 98334 98834 Astrocyte_peak_1 32.77411693033765 .
GL000194.1 114754 115254 Oligodendrocyte_peak_3 18.571321502075012 .
GL000194.1 104372 104872 Astrocyte_peak_2,Oligodendrocyte_peak_1,Oligodendrocyte_peak_2 47.233286164310144 .
GL000195.1 32228 32728 Astrocyte_peak_5 149.41141541771574 .
GL000195.1 32880 33380 Astrocyte_peak_6 47.233286164310144 .
GL000195.1 30600 31100 Oligodendrocyte_peak_4,Oligodendrocyte_peak_5,Astrocyte_peak_3a,Astrocyte_peak_3b,Astrocyte_peak_4,Microglia_PVM_peak_1,OPC_peak_1 315.2098893006003 .
GL000195.1 31144 31644 OPC_peak_1,Oligodendrocyte_peak_6 117.30246975018129 .
GL000205.2 1206 1706 Oligodendrocyte_peak_7 8.622399268820542 .
GL000205.2 9098 9598 Astrocyte_peak_7 32.77411693033765 .
GL000205.2 39110 39610 Astrocyte_peak_8 20.24283692756149 .
is this issue solved? I'm hving same issues while running tutorial
No, I am still waiting for the reply
Let me know when you solve this.
Hi @Jinkyustar and @Citugulia40
I was not able to reproduce the error you are getting so you will have to help me a bit with trouble shooting.
Can you run the following commands and show the output?
annot
set(annot["Strand"])
from pycisTopic.utils import read_fragments_from_file
fragments = read_fragments_from_file(fragments_dict["10x_pbmc"]) #you might have to replace the key in your case
fragments
annotation = annot
flank_window = 1000
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)
overlap_with_TSS = fragments.join(tss_space_annotation, nb_cpu=1).df
overlap_with_TSS
set(overlap_with_TSS["Strand"])
Hope to solve this soon.
Best,
Seppe
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
#annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'})
annot["Strand"] = annot["Strand"].astype(np.int64)
This gives us Strand column as -1 and 1
annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + 1000
tss_space_annotation["Start"] = tss_space_annotation["Start"] - 1000
tss_space_annotation = tss_space_annotation[
["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation
pr.pyranges_main.PyRanges(tss_space_annotation)
This then gives you NaN values in the Strand column
annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'})
#or
tss_space_annotation['Strand'] = tss_space_annotation['Strand'].replace({1: "+", -1: "-"})
replacing 1 and -1 to +/- helped getting "overlap_with_TSS" .
tss_space_annotation_gr = pr.PyRanges(tss_space_annotation)
overlap_with_TSS = fragments.join(tss_space_annotation_gr, nb_cpu=1).df
However then it cause problem with "get_tss_matrix" function because the strand column is categorical with +/-
def get_tss_matrix(fragments, flank_window, tss_space_annotation):
"""
Get TSS matrix
"""
overlap_with_TSS = fragments.join(tss_space_annotation, nb_cpu=1).df
if len(overlap_with_TSS) == 0:
return
overlap_with_TSS["Strand"] = overlap_with_TSS["Strand"].astype(np.int32)
I have the same issue, annot['Strand'] = annot['Strand'].replace({1: '+', -1: '-'}) solved the NaN error but then I get this one:
ValueError: Cannot cast object dtype to int32
Thank you @Citugulia40 , @Jinkyustar and @MariaRosariaNucera
I was able to reproduce the issue now. It is related to the pyranges version.
For versions above or equal to 0.0.128:
import pyranges as pr
pr.__version__
>>> '0.0.128'
import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
#filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
#annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
flank_window = 1000
annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)
tss_space_annotation
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int64) | (int64) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 3068168 | 3070168 | nan |
| chr1 | 3068197 | 3070197 | nan |
| chr1 | 3068211 | 3070211 | nan |
| chr1 | 3068203 | 3070203 | nan |
| ... | ... | ... | ... |
| chrY | 57066898 | 57068898 | nan |
| chrY | 57206481 | 57208481 | nan |
| chrY | 57183216 | 57185216 | nan |
| chrY | 57183226 | 57185226 | nan |
+--------------+-----------+-----------+--------------+
For versions below 0.0.128
import pyranges as pr
pr.__version__
>>> '0.0.127'
import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
#filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
#annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
flank_window = 1000
annotation = annot
tss_space_annotation = annotation[["Chromosome", "Start", "Strand"]]
tss_space_annotation["End"] = tss_space_annotation["Start"] + flank_window
tss_space_annotation["Start"] = tss_space_annotation["Start"] - flank_window
tss_space_annotation = tss_space_annotation[
["Chromosome", "Start", "End", "Strand"]
]
tss_space_annotation = pr.PyRanges(tss_space_annotation)
tss_space_annotation
+--------------+-----------+-----------+--------------+
| Chromosome | Start | End | Strand |
| (category) | (int64) | (int64) | (category) |
|--------------+-----------+-----------+--------------|
| chr1 | 3068168 | 3070168 | 1 |
| chr1 | 3068197 | 3070197 | 1 |
| chr1 | 3068211 | 3070211 | 1 |
| chr1 | 3068203 | 3070203 | 1 |
| ... | ... | ... | ... |
| chrY | 57066898 | 57068898 | 1 |
| chrY | 57206481 | 57208481 | 1 |
| chrY | 57183216 | 57185216 | 1 |
| chrY | 57183226 | 57185226 | 1 |
+--------------+-----------+-----------+--------------+
For now I fixed the version to below 0.0.128, in the future we will have to update this code. https://github.com/aertslab/pycisTopic/commit/e563fb6647380e1d510bed38779adc2034ec6292
Did this solve your issue?
Best,
Seppe
Thank you @SeppeDeWinter , I confirm I also get 1 and -1 instead of the NaNs using pyranges version = '0.0.127'.
I am also able to solve this with pyranges version = '0.0.127'. Thank you so much
Thank you so much. I no longer get the error
Hi, I seem to have the same error and when I check my pyranges version, it is '0.0.127'.
Following the tutorial on https://scenicplus.readthedocs.io/en/latest/pbmc_multiome_tutorial.html, I seem to be getting data type conversion error at the qc step. I tried to remove NaNs in the annot dataset and the consensus_bed and fragments files dont appear to have any NaNs
The error is coming from:
The error seems to be coming from the creating coverage matrix step. This is the error output:
I'm running this on a jupyter notebook after setting up my environment as follows:
Any help would be greatly appreciated