CMU-SAFARI / RawHash

RawHash is the first mechanism that can accurately and efficiently map raw nanopore signals to large reference genomes (e.g., a human reference genome) in real-time without using powerful computational resources (e.g., GPUs). Described by Firtina et al. (published at https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440)
https://academic.oup.com/bioinformatics/article/39/Supplement_1/i297/7210440
GNU General Public License v3.0
39 stars 5 forks source link

A few questions on rawhash #1

Closed hasindu2008 closed 12 months ago

hasindu2008 commented 1 year ago

Hi,

rawhash looks fantastic and I wanted to give it a try. But I got an error in the compilation process.
I have attached the log from the make command for your reference. Could you have a look and let me know some tips? log.txt

Also, I have a few questions.

  1. Does rawhash support the latest R10.4.1?
  2. Does rawhash support RNA?
canfirtina commented 1 year ago

Hi @hasindu2008,

Thanks for your interest in using RawHash!

Interesting that you got this error. We were not getting this linking error. I just changed the Makefile to link pthread and standard C++ libraries. Could you pull the newest changes and try again?

For your questions:

  1. RawHash should support the latest chemistry given the correct nanopore models. However, the k-mer model file parsing currently assumes a certain file format as shown here: https://github.com/nanoporetech/kmer_models/tree/d14beec6dc8eb7bcaeff1ee87dc33d99b54f7513/r9.4_180mv_450bps_6mer. If your k-mer model has a different structure, it should be converted to this format before usage.

We are planning to provide an easy-to-usage support for the recent chemistries soon though. This includes the easy parsing of the recent k-mer model released my ONT (with fewer columns) and optimizing the parameters for the recent chemistries. Quantization and other parameters (-k, -e, -q, -l) may need to be slightly changed to optimize accuracy and performance.

  1. We have not performed evaluations on RNA dataset. Similar to my previous response, certain parameters may need to be optimized also for RNA.
hasindu2008 commented 1 year ago

@canfirtina

Did you forget to commit? I cannot see a recent change in the makefile.

canfirtina commented 1 year ago

@hasindu2008

You are absolutely right! Probably due to the networking issue, the push did not go through on our end.

The following commit is now pushed: 9099df0

hasindu2008 commented 1 year ago

Unfortunately I am getting the same error.

/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(table.cc.o): In function `arrow::SimpleTable::Validate() const':
table.cc:(.text._ZNK5arrow11SimpleTable8ValidateEv[_ZNK5arrow11SimpleTable8ValidateEv]+0x94): undefined reference to `std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(type.cc.o): In function `arrow::Field::ToString(bool) const [clone .localalias]':
type.cc:(.text+0x52e9): undefined reference to `std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(type.cc.o): In function `arrow::FixedSizeBinaryType::ToString() const [clone .localalias]':
type.cc:(.text+0x543f): undefined reference to `std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(type.cc.o):type.cc:(.text+0x54cf): more undefined references to `std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >::basic_stringstream()' follow
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(thread_pool.cc.o): In function `std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::~_State_impl()':
thread_pool.cc:(.text+0x3db): undefined reference to `std::thread::_State::~_State()'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(thread_pool.cc.o): In function `arrow::internal::ThreadPool::LaunchWorkersUnlocked(int) [clone .localalias]':
thread_pool.cc:(.text+0x1e34): undefined reference to `std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)())'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(thread_pool.cc.o): In function `std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}> > >::~_State_impl()':
thread_pool.cc:(.text+0x47f): undefined reference to `std::thread::_State::~_State()'
/mnt/d/hasindu2008.git/RawHash/src/../extern/pod5-0.1.8-Linux/lib64/libarrow.a(thread_pool.cc.o):(.data.rel.ro+0x10): undefined reference to `typeinfo for std::thread::_State'
collect2: error: ld returned 1 exit status
Makefile:149: recipe for target 'rawhash' failed
make[1]: *** [rawhash] Error 1
make[1]: Leaving directory '/mnt/d/hasindu2008.git/RawHash/src'
Makefile:4: recipe for target 'rawhash' failed
make: *** [rawhash] Error 2

Do you use some compiled binaries for some libraries? This at a glance looks like an incompatibility of some precompiled binaries with the particular compiler version I am using.

canfirtina commented 1 year ago

Hi @hasindu2008,

You are correct. Due to incorrect parsing of your system information, RawHash was probably downloading the incompatible POD5 libraries. I have been working on making the Makefile more flexible to mitigate these issues.

I have made some changes to Makefile. You can now either 1) disable compiling the POD5 libraries if you do not need them or 2) provide the path to your own POD5 include and lib directories. Please check https://github.com/CMU-SAFARI/RawHash#compiling-with-hdf5-slow5-and-pod5 for more detailed information.

For sanity check, could you try to compile by disabling POD5 as follows:

make NOPOD5=1

Additionally, I have changed the implementation for parsing the k-mer model files. RawHash should now correctly parse both R9 and R10 k-mer model files. For R10, you may need to specify -k 9 if the model includes the values for 9-mers. Please check the help message for further details.

Note: We have not tested RawHash for R10 to provide the best possible parameter settings. The default preset parameters for R10 will be available soon.

canfirtina commented 1 year ago

Hi @hasindu2008,

Are you still experiencing the same problem? (if this is the case, please for now compile it with the NOPOD5=1 option)

On a separate note, we have also added the SLOW5 support to RawHash although I could not extensively test it in a wide range of scenarios. I would appreciate your feedback if you notice any issues there as well.

hasindu2008 commented 1 year ago

Hi I still couldn't find some time to get back to this. But i will test this soon and let you know. Thanks.

hasindu2008 commented 1 year ago

@canfirtina

Fantastic! Thanks for adding the S/BLOW5 support. I was able to quickly compile using make NOHDF5=1 NOPOD5=1. Also, I was able to execute rawhash on a little dataset and it did the job quite fast!

Have you done a large benchmark - interested to see how much ri_read_sig_slow5 takes compared to the actual mapping? Why I am asking this is, If the ri_read_sig_slow5 becomes a bottleneck (which currently performs S/BLOW5 reading + decompression+parsing and the pA conversion using a single thread), you can parallelise that part using multiple threads as in this advanced example here. The basic idea is that you use a slow5_get_next_bytes() inside ri_read_sig_slow5 to fetch a batch of records, then you do the slow5_decode() and the pA conversion part using multiple threads inside your multi-threaded processing code. These advanced slow5 functions are documented here https://hasindu2008.github.io/slow5lib/slow5_api/slow5_low_level_api.html.

Would you mind if I added rawHash to the list here?

Suggestions: Consider renaming the char* strsep(char** stringp, const char* delim) in rutils.c to something like char* strsep(char** stringp, const char* delim). I had to do this as the compilation was otherwise erroring out due to a clash with the strsep in GNU C/C++ compiler.

canfirtina commented 1 year ago

@hasindu2008

Excellent! Glad to hear that things are working fine now.

I haven't done a large benchmark yet to identify the potential bottlenecks. I will keep in mind your suggestion as currently there is definitely some room for improving the thread utilization of RawHash. To implement your suggestion, we could assign more threads to the IO part rather than allocating almost all of them for mapping. Thanks a lot for the pointers and the tip regarding strsep, much appreciated!

Of course, we would be happy to see RawHash to be listed there.

hasindu2008 commented 12 months ago

Done. Let me know if you had any difficulties in using slow5lib and if you have any suggestions. Nowadays, it is hard to find anyone writing things in C. Feel free to close this issue.