diogomribeiro / sc_cop

Single cell local gene co-expression project
MIT License
0 stars 0 forks source link

How could I generate the standard `info` column in `expression_matrix.bed`? #3

Open jiangpuxuan opened 1 year ago

jiangpuxuan commented 1 year ago

Here is the standard format of expression_matrix.bed in CODer.py:

 Example format:
            #chr    start   end     gene    info    strand  sample1  sample2
            GL000192.1      495564  495565  ENSG00000277655.1_5     L=451;T=unprocessed_pseudogene;R=GL000192.1:493155-495565;N=AC245407.1  -       0.4       0
            GL000193.1      81322   81323   ENSG00000280081.3_5     L=2485;T=lincRNA;R=GL000193.1:49232-81323;N=LINC01667   -       0       1.34

        Note: there is an alternative format in which the header labels are as follows:
            #chr    start   end     id      gid     strd    sample1    sample2
1679282831482

Is the info column strictly required?Is the gid of the "alternative format" equal to info ? How could I generate the standard format 'expression.mtx' by my 'filtered_feature_bc_matrix' and '.gtf'?

jiangpuxuan commented 1 year ago

There are something wrong with my info column:

START STEP : 'Processing phenotype coordinates..'
Error during execution of CODer.py. Aborting :
list index out of range
---------------
Traceback (most recent call last):
  File "/Array-Tuqiang/pxjiang/packages/sc_cop-main/CODer.py", line 1248, in <module>
    run.run()
  File "/Array-Tuqiang/pxjiang/packages/sc_cop-main/CODer.py", line 1146, in run
    self.read_real_coordinates()
  File "/Array-Tuqiang/pxjiang/packages/sc_cop-main/CODer.py", line 335, in read_real_coordinates
    region = info.split(";")[2].split(":")[1]

The code region = info.split(";")[2].split(":")[1] may fetch the region information, ;R=GL000192.1:493155-495565;, in the Example format.

I noticed that ENSG00000277655.1_5 starts from 495564 to 495565, differing from ;R=GL000192.1:493155-495565, why?


So how could I generate info column? Should I just write the region information into info column like R=GL000192.1:493155-495565 so everything will be OK? Thank you for your help!

diogomribeiro commented 1 year ago

Hi, you can have a look at the --determineTSS flag. Basically you can either give TSS region as the "start" column, or calculate it from the info field. The info field is not really mandatory, depending on how you process your files.