bvaldebenitom / SoloTE

GNU General Public License v3.0
23 stars 6 forks source link

ERROR: Requested column 3, but database file - only has fields 1 - 0. #17

Closed s2hui closed 3 weeks ago

s2hui commented 1 year ago

Hello,

I am running into a similar error as documented in #14 . I created another issue because I noticed the thread in #14 is ongoing and I didn't want to add another thread as it might be confusing:

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.

*****
***** ERROR: Requested column 3, but database file - only has fields 1 - 0.
['sample_countpercell_1.counts', 'sample_countpercell_8.counts', 'sample_countpercell_9.counts', 'sample_countpercell_MT.counts', 'sample_countpercell_X.counts', 'sample_countpercell_Y.counts', 'sample_countpercell_KI270728.1.counts', 'sample_countpercell_KI270727.1.counts', 'sample_countpercell_GL000009.2.counts', 'sample_countpercell_GL000194.1.counts', 'sample_countpercell_GL000205.2.counts', 'sample_countpercell_GL000195.1.counts', 'sample_countpercell_GL000219.1.counts', 'sample_countpercell_KI270734.1.counts', 'sample_countpercell_GL000218.1.counts', 'sample_countpercell_KI270721.1.counts', 'sample_countpercell_KI270726.1.counts', 'sample_countpercell_KI270711.1.counts', 'sample_countpercell_10.counts', 'sample_countpercell_11.counts', 'sample_countpercell_12.counts', 'sample_countpercell_13.counts', 'sample_countpercell_14.counts', 'sample_countpercell_15.counts', 'sample_countpercell_16.counts', 'sample_countpercell_17.counts', 'sample_countpercell_18.counts', 'sample_countpercell_19.counts', 'sample_countpercell_2.counts', 'sample_countpercell_20.counts', 'sample_countpercell_21.counts', 'sample_countpercell_22.counts', 'sample_countpercell_3.counts', 'sample_countpercell_4.counts', 'sample_countpercell_5.counts', 'sample_countpercell_6.counts', 'sample_countpercell_7.counts']
sample_allcounts.txt exists. Will be removed
Creating final results directory
/cluster/projects/group/herv/out/sample/sample_SoloTE_output was created
Traceback (most recent call last):
  File "SoloTE_developmental_20230503.py", line 342, in <module>
    file_table = pd.read_table(input_te_file,header=None,sep="\t")
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/util/_decorators.py", line 211, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1289, in read_table
    return _read(filepath_or_buffer, kwds)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 605, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1753, in _make_engine
    return mapping[engine](f, **self.options)
  File "/cluster/home/s2hui/.local/share/r-miniconda/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 79, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

I'm using the versions of dependencies below:

I followed the issue thread in #14 and ran the developmental version of the pipeline script (SoloTE_developmental_20230503.tar.gz).

An output directory was created with barcode, features and matrix files, however with the above errors.

-rw-r--r-- 1 s2hui group 4.1M May 12 09:12 barcodes.tsv
-rw-r--r-- 1 s2hui group 359K May 12 09:12 features.tsv
-rw-r--r-- 1 s2hui group   73 May 12 09:12 matrix.mtx

I also noticed the following files in the temp directory are empty:

-rw-r--r-- 1 s2hui group 0 May 12 08:46 sample_locustes_2.txt -rw-r--r-- 1 s2hui group 0 May 12 08:44 sample_locustes.txt -rw-r--r-- 1 s2hui group 0 May 11 22:15 sample_nogenes_overlappingtes.bed -rw-r--r-- 1 s2hui group 0 May 11 22:15 sample_selectedtes.bed -rw-r--r-- 1 s2hui group 0 May 12 08:46 sample_subftes_2.txt -rw-r--r-- 1 s2hui group 0 May 12 08:44 sample_subftes.txt

I wonder if I also have an issue with my TE file? I only have 4 fields in my file as I don't have score values.

chr1    1412252 1416234 chr1|1412252|1416234|Harlequin-int~LTR2B:LTR:ERV1|-
chr1    1417491 1418852 chr1|1417491|1418852|Harlequin-int~LTR2B:LTR:ERV1|-
chr1    3801472 3803129 chr1|3801472|3803129|HERVK13-int~LTR13:LTR:ERVK|-
chr1    3803436 3803669 chr1|3803436|3803669|HERVK13-int:LTR:ERVK|-
chr1    3805225 3805922 chr1|3805225|3805922|HERVK13-int:LTR:ERVK|-

Thanks alot for your help, @s2hui

bvaldebenitom commented 1 year ago

Hi @s2hui,

can you share the first few lines of your BAM file?

s2hui commented 1 year ago

Hello,

Here are the first 5 lines. Please let me know if there is another more appropriate command to run. Thank you!

$ samtools view possorted_genome_bam.bam | head -n 5
NB552139:27:HKCYMBGXB:1:11112:4266:17168    272 1   10546   1   56M *   0   0   ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTG    AAEEEEEEEEA<EEEEEEEEEEEEEEEEEEEEEEEE/AEAEEAEEAE6EEEAAAAA    NH:i:3  HI:i:3  AS:i:53 nM:i:1  RE:A:I  li:i:0  BC:Z:TGACGCCC   QT:Z:AAAAAEEE   CR:Z:GCATTAGCATACTGTG   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:GCATTAGCATACTGTG-1 UR:Z:ATTTTAGTGGGC   UY:Z:EEEEEEEEEEEE   UB:Z:ATTTTAGTGGGC   RG:Z:counts_out_sample:0:1:HKCYMBGXB:1
NB552139:27:HKCYMBGXB:1:11306:12412:13291   272 1   10546   1   56M *   0   0   ATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTG    EEEEE/EEEEEEAEEEEEEEEEE/EEEEEEEEEEEEEEE/EEEEEEEEEEEAAAAA    NH:i:3  HI:i:3  AS:i:53 nM:i:1  RE:A:I  li:i:0  BC:Z:TGACGCCC   QT:Z:AAAAAEEE   CR:Z:GCATTAGCATACTGTG   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:GCATTAGCATACTGTG-1 UR:Z:ATTTTAGTGGGC   UY:Z:EEEEEEEEEEEE   UB:Z:ATTTTAGTGGGC   RG:Z:counts_out_sample:0:1:HKCYMBGXB:1
NB552139:27:HKCYMBGXB:1:23203:8404:3397 256 1   11279   0   1S55M   *   0   0   CGCCAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATA    AAAAAEEEEEEEEEEEEEEEEEEEEEEE/EEEEAEEEEAAEEAEE<EAEEEEAAE/    NH:i:6  HI:i:3  AS:i:54 nM:i:0  RE:A:I  li:i:0  BC:Z:GATTAGAT   QT:Z:AAAAAEEE   CR:Z:AATGGAACAGTAGAAT   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:AATGGAACAGTAGAAT-1 UR:Z:ATATCCTATGTG   UY:Z:EEEEEEEEEEEE   UB:Z:ATATCCTATGTG   RG:Z:counts_out_sample:0:1:HKCYMBGXB:1
NB552139:27:HKCYMBGXB:3:12402:19279:16040   256 1   11279   0   56M *   0   0   GCCAGCGCCCCCTGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAG    AAAAAEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEE/E    NH:i:6  HI:i:2  AS:i:55 nM:i:0  RE:A:I  li:i:0  BC:Z:GATTAGAT   QT:Z:AAAAAEEE   CR:Z:ATCCACCTCGCTGTTC   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:ATCCACCTCGCTGTTC-1 UR:Z:CCATATACGTGT   UY:Z:EEEEEEEEEEEE   UB:Z:CCATATACGTGT   RG:Z:counts_out_sample:0:1:HKCYMBGXB:3
NB552139:27:HKCYMBGXB:2:23204:15697:17763   256 1   11310   0   56M *   0   0   CAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGA    AAAAAEEEEEEAE<EEEEEAEEEEEE6EAEE<E<AA<EAE6EEAA/EEEEEEAEEE    NH:i:6  HI:i:3  AS:i:55 nM:i:0  RE:A:I  li:i:0  BC:Z:ACCGTATG   QT:Z:AAAAAEEE   CR:Z:TCAAGCATCGGAGTAG   CY:Z:AAAAAEEEEEEEEEEE   CB:Z:TCAAGCATCGGAGTAG-1 UR:Z:CTTCGGTTTCCT   UY:Z:EEEEEEEEEEEE   UB:Z:CTTCGGTTTCCT   RG:Z:counts_out_sample:0:1:HKCYMBGXB:2
cche commented 1 year ago

Hi @bvaldebenitom,

I got the same error message and solved it by changing grep \"chr\" (in lines 292 and 293) by a string that's common to my chromosomes, which do not conform to the chr# format. Maybe adding the pattern to search for as a command line could solve this problem for those that do not have genomes with chromosomes named chr1, chr2, etc.

After that, I got the following error message: Traceback (most recent call last): File "SoloTE_pipeline.py", line 321, in marketmatrix_line3 = genenumber.stdout.split(" ")[0]+" "+barcodenumber.stdout.split(" ")[0]+" "+allcounts_number.stdout.split(" ")[0] AttributeError: 'str' object has no attribute 'stdout'

Which I solved by deleting the 3 instances of "stdout." on line 321 as genenumber, barcodenumber and allcounts_number are not channels any more, but strings.

This seems to work, but please correct me if it is not the right way to solve these problems.

In the meantime, I will continue with the downstream analysis of the results.

Thanks for this tool!! Cristian

bvaldebenitom commented 1 year ago

@s2hui

the problem is that your BAM file sequence / chromosome names don't match those in the BED file.

For a quick fix, please run the following command to create a BED file: awk 'BEGIN{FS=OFS="\t"}{gsub("chr","",$1); print $0}' Current_BED_file > NEW_BED_file.

Then, remove all the files in the temp directory, and re-run SoloTE.

bvaldebenitom commented 1 year ago

Hi @cche!

Thank you for sharing your result. In the next release, we will fix the "chr" issue. Glad that you solved it.

Can you share the information about your operating system, and Python versions? We have noticed that the "stdout" reference works in some OSs and not in others (Linux vs OSx for example).

gammertens commented 1 year ago

To add to this thread, I encountered the same issue as @cche and could solve it the same way. Running a previous release (1.07 or 1.06) also works. I'm running on macOS Ventura and Python 3.9.5!

s2hui commented 1 year ago

@bvaldebenitom

Thank you, I had noticed the BAM file sequence / chromosome name mismatch and had updated the bed file accordingly but did not delete the temp files before running.

After deleting the temp files, it appears to be working now!

Thanks again for your help!

cche commented 1 year ago

Hi @bvaldebenitom,

I use Rocky linux and ubuntu with python 3.10.11 installed with conda.

It is very strange that .stdout works at all in other OSs as you are assigning the variables that you used to store the CompletedProcess object, to the output of re.sub() which is a string and does not have a .stdout attribute at all.

I hope you solve these OS differences so that your code is stable everywhere.

Thanks again for this tool. The downstream analyses look great!