PalMuc / TransPi

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly
Other
26 stars 14 forks source link

Error in process busco4_dist #32

Open ghost opened 2 years ago

ghost commented 2 years ago

Hi, I apologize for my frequent contacts.

When the runninfg of SOS_busco.py in process busco4_dist, I got following error,

Command error:
  Traceback (most recent call last):
    File "/mnt/data/software/TransPi/bin/SOS_busco.py", line 38, in <module>
      busco_df = pd.read_csv(input_busco_file, sep=',',header=0,names=['Busco_id','Status','Sequence','Score','Length'])
    File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
      return _read(filepath_or_buffer, kwds)
    File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 458, in _read
      data = parser.read(nrows)
    File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1186, in read
      ret = self._engine.read(nrows)
    File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2145, in read
      data = self._reader.read(nrows)
    File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
    File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
    File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
    File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
    File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
  pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 51, saw 8

I think this is a problem for SOS_busco.py input file(In my case, Read_R_all_busco4.tsv). Most of lines of my Read_R_all_busco4.tsv have 6 commas (7 columns), like this. 0at38820,Duplicated,SOAP.k25.scaffold27258,8202.3,4167,https://www.orthodb.org/v10?query=0at38820,sacsin

However, some lines of my file have 7 or 8 commas ( 8 or 9 columns) like this. 121at38820,Complete,SOAP.k25.scaffold11722,3027.5,1446,https://www.orthodb.org/v10?query=121at38820,Zinc finger, RING-type I think that this difference in the number of commas (columns) is the cause of this pandas error.

SOS_busco.py doesn't seem to use columns 6 onwards in the input file. If so, we can remove columns 6 onwards before SOS_busco.py. https://github.com/PalMuc/TransPi/blob/899d16028e2d84e746c8c0dda1c6ba9ebcca050e/TransPi.nf#L1591-L1592

This is an example of my suggestion for revising.

cat $transpi_tsv | grep -v "#" | tr "\\t" "," >>$all_busco
awk -F',' 'OFS="," {print $1,$2,$3,$4,$5}' $all_busco > some.csv
SOS_busco.py -input_file_busco some.csv -input_file_fasta $assembly -min ${params.minPerc} -kmers ${params.k}
rm -rf some.csv

I hope this helps you. Thank you.

rivera10 commented 2 years ago

Hello @HarukiNakamura,

No worries. Thanks for finding issues and providing suggestions to TransPi. We appreciate it.

You are right, the last column will cause issues since the name has a comma and SOS_busco.py will fail. I think the easiest solution is what you suggested. I will do a test and modify the code. Thanks!

Best, Ramón

rivera10 commented 2 years ago

Pinging @n-conci

AlexGaithuma commented 1 year ago

this works:

1517                cat full_table_*.tsv | grep -v "#" | tr "\t" "," | cut -d ',' -f1-5 >.busco_names.txt
1591                cat $transpi_tsv | grep -v "#" | tr "\t" "," | cut -d ',' -f1-5 >>$all_busco