more incorrect number of column errors

EI-CoreBioinformatics / portcullis

Splice junction analysis and filtering from BAM files

https://ei-corebioinformatics.github.io/portcullis/

GNU General Public License v3.0

38 stars 9 forks source link

more incorrect number of column errors #51

Open adamfreedman opened 4 years ago

adamfreedman commented 4 years ago

I just installed the latest portcullis with bioconda, so all versioning issues should be managed within that install, correct? Even so, I am getting:

src/junction.cc(1242): Throw in function static std::shared_ptr portcullis::Junction::parse(const string&) Dynamic exception type: boost::wrapexcept std::exception::what: std::exception [portcullis::JunctionError*] = Could not parse line due to incorrect number of columns. This is probably a version mismatch. Check file and portcullis versions. Expected 75 columns. Found 74. Line: 37950851391Sca51543818022904231742712279623201-?-AAN0006560604061.002401.79247999999999990.0102272709900011002043000000166666666666666664444

lucventurini commented 4 years ago

Dear @adamfreedman , I will have a look and let you know.

Apologies for the late reply but I was on leave last week.

Kind regards

swarbred commented 4 years ago

Dear @adamfreedman

While we fix this try

cleaning the portcullis_all.junctions.tab output file of the junc stage

cat portcullis_all.junctions.tab |awk '!(($11=="?" && $14=="NA") || ($11=="?" && $13=="NA"))' > portcullis_all.junctions.cleaned.tab

then run the filt stage passing in the cleaned file

On a similar example which was failing for me the cleanup removed 44 lines and then the filt stage completed

lucventurini commented 4 years ago

Hi @swarbred, Would you be able to send over the location (or even attach it here /by email) of the offending file? I might try to see what's happening.

Best

swarbred commented 4 years ago

Hi @lucventurini

I will copy the porcullis out to a location you have access to and send you details

lucventurini commented 4 years ago

Hi @adamfreedman , @swarbred , I found the problem. The script rule_filter.py (called during the filtering stage) does the necessary filtering before the self-training procedure, using pandas. When it writes down the final lines, though, it writes the "NA" values as an empty field rather than as a specific value. This subsequently breaks the parsing of the portcullis C++ library that should load the junctions into the self-training procedure. I will endeavour to find a fix ASAP.

Kind regards

lucventurini commented 4 years ago

Hi all, I think that 9f86ebe could fix this issue. I basically force rule_filter.py to output all the NAs as "NA" strings in the filtered files, and this should ensure compatibility with the C++ parser. I have not tested it extensively.

@swarbred if you could install and test on the problematic dataset, we could confirm the fix.

Best

lucventurini commented 4 years ago

Also: the bug is triggered by the fact that occasionally there will be a splicing junction with a donor or an acceptor with dinucleotide "NA". This gets interpreted by pandas as a NaN value, which is not the intended behaviour! I hopefully fixed this in ebe42fb by removing "NA" from the valid list of NaN values for pandas.read_csv.