Open adamfreedman opened 4 years ago
Dear @adamfreedman , I will have a look and let you know.
Apologies for the late reply but I was on leave last week.
Kind regards
Dear @adamfreedman
While we fix this try
cleaning the portcullis_all.junctions.tab output file of the junc stage
cat portcullis_all.junctions.tab |awk '!(($11=="?" && $14=="NA") || ($11=="?" && $13=="NA"))' > portcullis_all.junctions.cleaned.tab
then run the filt stage passing in the cleaned file
On a similar example which was failing for me the cleanup removed 44 lines and then the filt stage completed
Hi @swarbred, Would you be able to send over the location (or even attach it here /by email) of the offending file? I might try to see what's happening.
Best
Hi @lucventurini
I will copy the porcullis out to a location you have access to and send you details
Hi @adamfreedman , @swarbred ,
I found the problem. The script rule_filter.py
(called during the filtering stage) does the necessary filtering before the self-training procedure, using pandas
. When it writes down the final lines, though, it writes the "NA" values as an empty field rather than as a specific value.
This subsequently breaks the parsing of the portcullis C++ library that should load the junctions into the self-training procedure.
I will endeavour to find a fix ASAP.
Kind regards
Hi all, I think that 9f86ebe could fix this issue. I basically force rule_filter.py
to output all the NAs as "NA" strings in the filtered files, and this should ensure compatibility with the C++ parser. I have not tested it extensively.
@swarbred if you could install and test on the problematic dataset, we could confirm the fix.
Best
Also: the bug is triggered by the fact that occasionally there will be a splicing junction with a donor or an acceptor with dinucleotide "NA".
This gets interpreted by pandas
as a NaN
value, which is not the intended behaviour! I hopefully fixed this in ebe42fb by removing "NA" from the valid list of NaN
values for pandas.read_csv
.
I just installed the latest portcullis with bioconda, so all versioning issues should be managed within that install, correct? Even so, I am getting:
src/junction.cc(1242): Throw in function static std::shared_ptr portcullis::Junction::parse(const string&)
Dynamic exception type: boost::wrapexcept
std::exception::what: std::exception
[portcullis::JunctionError*] = Could not parse line due to incorrect number of columns. This is probably a version mismatch. Check file and portcullis versions. Expected 75 columns. Found 74. Line:
37950851391Sca51543818022904231742712279623201-?-AAN0006560604061.002401.79247999999999990.0102272709900011002043000000166666666666666664444