GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

Can't read Gene Groups(scsnv_index_genes.txt.gz) #5

Closed ttab963 closed 3 years ago

ttab963 commented 3 years ago

Hello,

I'm trying to scsnv map with 10x 5' V2 data, but it can't read scsnv_index_genes.txt.gz file. This is Run commend & error messages.

scsnv map -l V2 -i scsnv_index -g /data/users/sjp/Projects/scLUPUS/scsnv_index/scsnv_index_bwa.bwt \ -b /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/barcode -t 24 --bam-write 4 -q 4 \ -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2 \ -c /data/users/sjp/Projects/scLUPUS/scsnv_index/scsnv_index_genes.txt.gz \ /data/rawData/scLUPUS/fastq/OM-HL-007-UC2

#[00:00:00] Loaded 222696 known barcodes from /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/barcode_counts.txt.gz #Could not open scsnv_index_genes.txt.gz for reading

Is this scsnv_index_genes.txt right? It has proper information(ref , tid, gid, gene_id, gene_name....).

Thanks, Paek.

GWW commented 3 years ago

Hi Paek,

I think there are a few paths that are incorrect. The -i option needs to be a full path to scsnv_index folder (that you indexed with scsnv index). The other issue I see is that the -g option should be to an index of the whole genome you used to generate the scsnv_index, ie. bwa index /path/to/genome.fa and the -g option would then be /path/to/genome.fa

Finally the -c option is meant to be a basic text file that is used to quantify specific groups of genes such as mitochondrial genes. It is entirely optional and you can leave it out. An example would be something like:

gene_id<tab>group
ENSG00000211459 MT
scsnv map -l V2 -i /data/users/sjp/Projects/scLUPUS/scsnv_index/scsnv_index -g /path/to/genome.fa -b /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/barcode -t 24 --bam-write 4 -q 4 -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2  /data/rawData/scLUPUS/fastq/OM-HL-007-UC2
ttab963 commented 3 years ago

Thanks for your kind answer, but another error comes again. I change command like this,

scsnv map -l V2 -i /data/users/sjp/Projects/scLUPUS/scsnv_index/scsnv_index -g /data/ref/tenX/refdata-gex-GRCh38-2020-A/fasta/genome.fa \ -b /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/barcode -t 24 --bam-write 4 -q 4 \ -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2 \ /data/rawData/scLUPUS/fastq/OM-HL-007-UC2

But somethings wrong. [M::bwa_idx_load_from_disk] read 0 ALT contigs [E::bwa_idx_load_from_disk] fail to locate the index files [00:00:00] Loaded 222696 known barcodes from /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/barcode_counts.txt.gz Index load failed`

I'm totally not friendly with those bwa files, but i think there are no problem(.sa, .amb, .ann, .pac, .bwt).

Thanks again, Paek

GWW commented 3 years ago

It seems like the tool can't find one of the BWA indices. I also noticed that you mentioned you are using a 5p library. For most of the commands you need to specify the library type with -l V3_5P or -l V2_5p depending on which library you are using.

Can you verify the contents of the folder `/data/users/sjp/Projects/scLUPUS/scsnv_index/

It should have these files:

scsnv_index_bwa.amb
scsnv_index_bwa.ann
scsnv_index_bwa.bwt
scsnv_index_bwa.pac
scsnv_index_bwa.sa
scsnv_index_genes.txt.gz
scsnv_index_lenghts.txt
scsnv_index_transcripts.fa.gz
scsnv_index_transcripts.txt.gz

The other folder /data/ref/tenX/refdata-gex-GRCh38-2020-A/fasta/ should have:

genome.fa
genome.fa.amb
genome.fa.ann
genome.fa.bwt
genome.fa.pac
genome.fa.sa

Edit:

I have updated the documentation to reflect the library settings and gene_groups file.

ttab963 commented 3 years ago

It works so good! Thanks again for your response!!

ttab963 commented 3 years ago

Hello again.

Sorry for reopen, but scsnvmisc cells makes another problems. I made summary.h5 file and it contained like this.

HDF5 "/data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5" {
GROUP "/" {
   DATASET "barcode_ids" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  SIMPLE { ( 222696 ) / ( 222696 ) }
      DATA {
      (0): 177804, 142753, 6495, 14469, 104697, 44792, 14336, 50210, 151124,
      (9): 150376, 207310, 94455, 57673, 183673, 48768, 125208, 26413, 89443,
.....

and run command is

scsnvmisc cells -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/ /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5

but error message came like this.

Traceback (most recent call last):
  File "/data/anaconda3/bin/scsnvmisc", line 4, in <module>
    __import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
  File "/data/anaconda3/lib/python3.8/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/data/anaconda3/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
    cmds.COMMANDS[cmd](args)
  File "/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/cells.py", line 201, in cells_cmd
    krates = brates[brates.molecules >= 1].copy()
  File "/data/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 5478, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'molecules'

I'm sorry to ask about too minor things, but this tool is so fascinating to give up.

Thanks.

GWW commented 3 years ago

No need to apologize. I want to make sure the tool works for everyone who wishes to use it. It's a strange error as that field should always be available. Can you run the command below and let me know if you get a similar output as to what I have included below?

h5ls /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5/barcode_rates

When I try it on a file I have processed I see:

align_ambiguous          Dataset {521359}
align_antisense          Dataset {521359}
align_barcode_corrected  Dataset {521359}
align_cdna               Dataset {521359}
align_intergenic         Dataset {521359}
align_intronic           Dataset {521359}
align_multimapped        Dataset {521359}
align_tag_qa_fail        Dataset {521359}
align_total_reads        Dataset {521359}
align_umi_qa_fail        Dataset {521359}
align_unmapped           Dataset {521359}
discarded_reads          Dataset {521359}
field_order              Dataset {23}
genes                    Dataset {521359}
group_MT                 Dataset {521359}
intron_molecules         Dataset {521359}
intron_pcr_dups          Dataset {521359}
intron_reads             Dataset {521359}
molecules                Dataset {521359}
pcr_dups                 Dataset {521359}
reads                    Dataset {521359}
ttab963 commented 3 years ago

Thanks, but it looks similar with yours. 'h5ls /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5/' is this.

barcode_ids              Dataset {222696}
barcode_rates            Group
barcodes                 Dataset {222696}
exonic                   Group
gene_ids                 Dataset {23713}
gene_names               Dataset {23713}
intronic                 Group

and 'h5ls /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5/barcode_rates' is this.

align_ambiguous          Dataset {222696}
align_antisense          Dataset {222696}
align_barcode_corrected  Dataset {222696}
align_cdna               Dataset {222696}
align_intergenic         Dataset {222696}
align_intronic           Dataset {222696}
align_multimapped        Dataset {222696}
align_tag_qa_fail        Dataset {222696}
align_total_reads        Dataset {222696}
align_umi_qa_fail        Dataset {222696}
align_unmapped           Dataset {222696}
discarded_reads          Dataset {222696}
field_order              Dataset {19}
genes                    Dataset {222696}
intron_molecules         Dataset {222696}
intron_pcr_dups          Dataset {222696}
intron_reads             Dataset {222696}
molecules                Dataset {222696}
pcr_dups                 Dataset {222696}
reads                    Dataset {222696}

I check this file in python, and barcode_rates/molecules has data like [46993 39395 36750 ... 0 0 0]. Could it be a HDF5 C and C++ libraries problem?

GWW commented 3 years ago

Thanks for your patience and help figuring out this issue.

I suspect it may be a string encoding issue with python. Pandas has issues with columns named after byte strings and I think on your system the strings are bytes. I pushed a few changes to the code to resolve this. If you can clone the repository and compile the cmake and python code again, it should hopefully resolve the issue.

ttab963 commented 3 years ago

Sorry for late. I had to change my envs for other reasons, and recompile your code. And unfortunately, there's a another issue but it seems to get problem with HDF5.

cmake -DCMAKE_BUILD_TYPE=Release .. This command goes wrong.

-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /data/anaconda3/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /data/anaconda3/bin/x86_64-conda-linux-gnu-cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /data/anaconda3/bin/x86_64-conda-linux-gnu-c++
-- Check for working CXX compiler: /data/anaconda3/bin/x86_64-conda-linux-gnu-c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- HDF5: Using hdf5 compiler wrapper to determine C configuration
-- HDF5: Using hdf5 compiler wrapper to determine CXX configuration
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find HDF5 (missing: HDF5_HL_LIBRARIES) (found version "1.10.6")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindHDF5.cmake:929 (find_package_handle_standard_args)
  CMakeLists.txt:14 (find_package)

-- Configuring incomplete, errors occurred!
See also "/data/users/sjp/programs/scsnv/build/CMakeFiles/CMakeOutput.log".
See also "/data/users/sjp/programs/scsnv/build/CMakeFiles/CMakeError.log".
(base) sjp@qlab:/data/users/sjp/programs/scsnv/build$ make
make: *** No targets specified and no makefile found.  Stop.

I used HDF5 in conda this time. Should I have to use other ways? this cmake way is so difficult to me ...

Thanks again anyway.

GWW commented 3 years ago

There seems to be some issues when trying to compile in a conda environment. I will try to find a work around and let you know.

GWW commented 3 years ago

I have cleaned up the build process and have successfully compiled the main program and the python library in a conda environment. I would delete the scsnv directory and clone it again.

I created the environment with the following command:

conda create -c conda-forge -c bioconda -n SCSNV_TEST gcc_linux-64 gxx_linux-64 cmake hdf5 zlib 

Hopefully this resolves your build issues.

ttab963 commented 3 years ago

Okay, compiling is fine and error that i mention before is passed. but I realize my summary.h5 file doesn't have "group_MT" and another error comes out actually. My command is, scsnvmisc cells -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/ \ /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5'

/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/cells.py:237: MatplotlibDeprecationWarning: The 'subsx' parameter of __init__() has been renamed 'subs' since Matplotlib 3.3; support for the old name will be dropped two minor releases later.
  ax.set_xscale('log', subsx=[2,3,4,5,6,7,8,9])
Traceback (most recent call last):
  File "/data/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'group_MT'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/anaconda3/bin/scsnvmisc", line 4, in <module>
    __import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
  File "/data/anaconda3/lib/python3.8/site-packages/pkg_resources/__init__.py", line 651, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/data/anaconda3/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1448, in run_script
    exec(code, namespace, namespace)
  File "/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
    cmds.COMMANDS[cmd](args)
  File "/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/cells.py", line 244, in cells_cmd
    MT_cutoff(axs[1], prates, args.mt_key, args.mt_mad, args.mt_max, args.mt_perc)
  File "/data/anaconda3/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/cells.py", line 99, in MT_cutoff
    pmt = 100.0 * df[key].values / mols
  File "/data/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py", line 3455, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/data/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'group_MT'

There is no error message in previous steps. If this error is because mitocondrial reads were not counted, which point do i wrong? My reference is h38 from 10x.

I'm sorry to bother you this long. Thanks again.

GWW commented 3 years ago

I have added an additional flag to the cell command to skip filtering by MT(%) if you did not specify a groups file with that information.

You can run it with

scsnvmisc cells --skip-mt -o /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/ /data/users/sjp/Projects/scLUPUS/scsnv_count/OM-HL-007-UC2/OM-HL-007-UC2.summary.h5'

Hopefully that resolves that issue.

ttab963 commented 3 years ago

Finally It works well from beginning to end! Thank you, GWW.

GWW commented 3 years ago

Great I am so glad it has all worked out. Thanks for your patience with all of the issues.