Issue running dataprep - Githubissues

acarmas1 commented 2 years ago

I've been trying to use xpore to identify m6A modifications in mRNA reads of honey bees.

I've already prepared my data using minimap2, samtools and nanopolish, but when I run dataprep using this code:

xpore dataprep \ --eventalign /project02/insect_multiomics/camila/xpore/Bee_Thorax/data/WT/nanopolish/eventalign.txt \ --gtf_or_gff /project02/insect_multiomics/camila/xpore/Bee_Thorax/GCF_003254395.2_Amel_HAv3.1_genomic.gff \ --transcript_fasta /project02/insect_multiomics/camila/xpore/Bee_Thorax/GCF_003254395.2_Amel_HAv3.1_cds_from_genomic.fna \ --out_dir dataprep \ --genome

I'm getting this error and I do not know how to fix it, if someone please could help me I'll appreciate it.

/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) Traceback (most recent call last): File "/opt/anaconda3/2021.05/bin/xpore", line 10, in sys.exit(main()) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 751, in dataprep annotation_dict,is_gff = readAnnotation(gtf_or_gff) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 221, in readAnnotation tx_id=ln[-1].split('transcript:')[1].split(';')[0] IndexError: list index out of range

yuukiiwa commented 2 years ago

Hi @acarmas1,

Thank you for reporting the bug! Do you mind sharing your gff file with us, please? I think this is due to the incompatibility of the gff file with our annotation processing function. Thank you!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi @yuukiiwa

Thanks for answering, this is my gff file: https://1drv.ms/u/s!AokqkR3muxL0g5QuiTk0rORmIBAHdg?e=WjqL1P

It looks like this:

yuukiiwa commented 2 years ago

Hi @acarmas1,

Thank you for sharing! I will look into this and get back to you hopefully by Friday.

Best wishes, Yuk Kei

yuukiiwa commented 2 years ago

Hi @acarmas1,

I have updated the readAnnotation() function in the ncbi_honeybee_gff branch which runs with the gff file you provided. Do you mind installing xpore from the ncbi_honeybee_gff branch and testing whether it works for you?

git clone https://github.com/GoekeLab/xpore.git
cd xpore
git checkout origin/ncbi_honeybee_gff
sudo python3 setup.py install

Thank you!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi Yuk,

I tried what you suggested and now I'm getting this error:

/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py:21: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) Traceback (most recent call last): File "/opt/anaconda3/2021.05/bin/xpore", line 33, in sys.exit(load_entry_point('xpore==2.1', 'console_scripts', 'xpore')()) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 761, in dataprep fasta_dict = readFasta(transcript_fasta,is_gff) File "/opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 177, in readFasta g_id=info[1],split(".")[0] NameError: name 'info' is not defined

yuukiiwa commented 2 years ago

Hi @acarmas1,

Do you mind sending me your fasta file, please? Or do you have a gtf file for honey bee instead? The gff option currently "works with GENCODE or ENSEMBL FASTA files". Thank you!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi Yuk,

This is my fasta file: https://1drv.ms/u/s!AokqkR3muxL0g5Q6tbjcR5wEDrwLrg?e=kL5f6I and yes I also have this gtf file: https://1drv.ms/u/s!AokqkR3muxL0g5Q7thCGIpc3mwk4Rw?e=IIgcUp

Thanks, Camila

yuukiiwa commented 2 years ago

Hi Camila (I will tag you here @acarmas1),

Sorry for the delayed reply (it was Lunar New Year out here)! I updated xpore dataprep on the ncbi_honeybee branch (https://github.com/GoekeLab/xpore/tree/ncbi_honeybee), which now works with your provided fasta and gtf files from the previous comment. Thanks!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi,

Thank you so much, it worked. I just have one last question, I run xpore dataprep with a different fasta file, that looks like this: Which has all the coding regions for the bees genome. Is there any difference between giving as an argument in --transcript_fasta this cds.fasta file or it the file has to be the reference genome?

yuukiiwa commented 2 years ago

Hi Camila (I will tag you here @acarmas1),

I am glad that the fix worked! We suggest running xpore dataprep --genome with a cDNA.fasta. Due to the formatting of the > lines of your cds.fasta file, xpore dataprep will not work with it. Thanks!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi Yuk, I run dataprep with the reference genome of honey bees, and it worked, but I got a diffmod.table empty, also I realized my data.json file is empty, too. I don't if that means the transcriptome of honeybees does not have the m6A modification, or something was wrong during the process.

yuukiiwa commented 2 years ago

Hi Camila,

Can you screenshot how the first few lines of your eventalign.txt, please? Also, were you using the same cDNA.fasta file for nanopolish eventalign? Thanks!

Best wishes, Yuk Kei

acarmas1 commented 2 years ago

Hi Yuk,

This is my eventalign.txt files and yes I used the same fasta file that I run in nanopolish eventalign.

acarmas1 commented 2 years ago

After running xpore diffmode I got this message: Using the signal of unmodified RNA from /opt/anaconda3/2021.05/lib/python3.8/site-packages/xpore/diffmod/model_kmer.csv 0 ids to be testing ... And the diffmod.table is empty.

Shruti-BioCode commented 1 year ago

I am running the data prep on with the command I had used the command for this sample xpore dataprep --eventalign eventalign.txt --out_dir 04_dataprep --n_processes 30 --readcount_min 5 the I am running into multiple issues with it.

for one sample it ran fine. but for others I get different issues.

for one sample it is the error below but is still proceeding:

/hpcnfs/data/cgb/conda_envs/xpore2.0/bin/xpore:33: DtypeWarning: Columns (7) have mixed types.Specify dtype option on import or set low_memory=False. sys.exit(load_entry_point('xpore==2.0', 'console_scripts', 'xpore')()) Traceback (most recent call last): File "/hpcnfs/data/cgb/conda_envs/xpore2.0/bin/xpore", line 33, in sys.exit(load_entry_point('xpore==2.0', 'console_scripts', 'xpore')()) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/dataprep.py", line 684, in dataprep parallel_index(eventalign_filepath,chunk_size,out_dir,n_processes,resume) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/dataprep.py", line 59, in parallel_index for chunk in pd.read_csv(eventalign_filepath, chunksize=chunk_size,sep='\t'): File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/io/parsers.py", line 1107, in next return self.get_chunk() File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/io/parsers.py", line 1167, in get_chunk return self.read(nrows=size) File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read ret = self._engine.read(nrows) File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read File "pandas/_libs/parsers.pyx", line 886, in pandas._libs.parsers.TextReader._read_low_memory File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 15 fields in line 3080042597, saw 19

The other is gg/xpore/scripts/dataprep.py:72: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) Traceback (most recent call last): File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2646, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'transcript_id'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/hpcnfs/data/cgb/conda_envs/xpore2.0/bin/xpore", line 33, in sys.exit(load_entry_point('xpore==2.0', 'console_scripts', 'xpore')()) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/xpore.py", line 67, in main options.func(options) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/dataprep.py", line 695, in dataprep parallel_preprocess_tx(eventalign_filepath,out_dir,n_processes,readcount_min,readcount_max,resume) File "/hpcnfs/data/cgb/conda_envs/xpore2.0/lib/python3.7/site-packages/xpore-2.0-py3.7.egg/xpore/scripts/dataprep.py", line 458, in parallel_preprocess_tx tx_ids_done = list(df_index['transcript_id'].unique()) File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/core/frame.py", line 2800, in getitem indexer = self.columns.get_loc(key) File "/hpcnfs/software/anaconda/anaconda3/envs/env_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1618, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1626, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'transcript_id'

GoekeLab / xpore

Issue running dataprep #123

for one sample it ran fine. but for others I get different issues.