cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
33 stars 42 forks source link

Bug in extracting sample id from script aggregate_rsem_results.py #524

Open mw201608 opened 1 year ago

mw201608 commented 1 year ago

In the container The rsem aggregation step failed to extract the proper sample ids when sample names contained character ., and may even crashed (error message shown below) when the partially extracted sample ids are not unique.

Traceback (most recent call last):
  File "/opt/src/aggregate_rsem_results.py", line 90, in <module>
    df_dict[c].to_hdf(tmp_store.name, '{0:s}{1:d}'.format(c,k))
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 2490, in to_hdf
    pytables.to_hdf(
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/io/pytables.py", line 282, in to_hdf
    f(store)
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/io/pytables.py", line 265, in <lambda>
    f = lambda store: store.put(
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1030, in put
    self._write_to_group(
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/io/pytables.py", line 1697, in _write_to_group
    s.write(
  File "/hpc/users/wangm05/.local/lib/python3.8/site-packages/pandas/io/pytables.py", line 3093, in write
    raise ValueError("Columns index has to be unique for fixed format")
ValueError: Columns index has to be unique for fixed format

This is because the sample ids are extracted through string split by ., sample_id = filename.split('.')[0]

It might be better to use a regular expression to extract the complete sample ids, like re.sub(".rsem.[a-z]+.results", "", filename)

Minghui

gaow commented 1 year ago

Thanks @mw201608 ! @hsun3163 please correct me if I'm wrong but that files was from GTEx repository right? I wonder if we should just change it and maintain a local copy of it along the lines @mw201608 suggested or if you can check they have an update? I say we start changing it locally and then we send PR on their repo as we see fit?