Alevin: Problem with PyPI vpolo ["Reading Alevin’s bfh (big freaking hash) file" section of Alevin tutorial]

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? Alevin single-cell mode.

Describe the bug Hi, I bumped into a problem following this tutorial https://combine-lab.github.io/alevin-tutorial/2018/output-format/ . It's the "Reading Alevin’s bfh (big freaking hash) file" section, where there are just 2 lines I should run. The problem is on the second line, "parser.read_bfh()" function.

It throws me a pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 110446, saw 20

I tried diagnosing the problem and looked into the input bfh.txt file. The problem wasn't just line 110447, but many other lines that had more than 1 field. So the real question breaks down into: should the bfh.txt file have only 1 field per row (line)? If this is the case, then the input bfh.txt file is problematic. If not, then the parser function is problematic, as it should account for more than 1 field.

To Reproduce Steps and data to reproduce the behavior:

Specifically, please provide at least the following information:

Which version of salmon was used?
1.4.0
How was salmon installed (compiled, downloaded executable, through bioconda)?
conda create -n salmon -c conda-forge -c bioconda salmon conda activate salmon
Which reference (e.g. transcriptome) was used?
generated index via "Protein-coding transcript sequences" here : https://www.gencodegenes.org/human/
generated the txp2gene.tsv file via "Comprehensive Gene Annotation, Region: PRI" here: https://www.gencodegenes.org/human/release_37lift37.html
Which read files were used?
5k_pbmc FASTQ file from 10x: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.2/5k_pbmc_v3
Which program options were used? salmon alevin -l ISR -1 5k_pbmc_v3_S1_L001_R1_001.fastq.gz 5k_pbmc_v3_S1_L002_R1_001.fastq.gz 5k_pbmc_v3_S1_L002_R1_001.fastq.gz -2 5k_pbmc_v3_S1_L001_R2_001.fastq.gz 5k_pbmc_v3_S1_L002_R2_001.fastq.gz 5k_pbmc_v3_S1_L003_R2_001.fastq.gz --chromiumV3 -i index -p 10 -o alevin_output --tgMap txp2gene.tsv --dumpBfh --noDedup --dumpBarcodeEq

Expected behavior The bfh.txt file should be parsed. In other words, the line parser.read_bfh("<PATH to alevin output folder>", "<PATH to t2g file>") should work without error, according to the tutorial below: https://combine-lab.github.io/alevin-tutorial/2018/output-format/

Screenshots If applicable, add screenshots or terminal output to help explain your problem. Screenshot of error:

Desktop (please complete the following information):

OS: [e.g. Ubuntu Linux, OSX]
Ubuntu Linux
Version [ If you are on OSX, the output of sw_vers. If you are on linux the output of uname -a and lsb_release -a] Ubuntu 20.04.2 LTS (64 bit)

Additional context

COMBINE-lab / salmon

Alevin: Problem with PyPI vpolo ["Reading Alevin’s bfh (big freaking hash) file" section of Alevin tutorial] #650