ParBLiSS / bruno

Distributed Memory De Bruijn Graph Library
Apache License 2.0
5 stars 0 forks source link

Unable to build the software #1

Open ShuangQiuac opened 5 years ago

ShuangQiuac commented 5 years ago

Hi, I downloaded the software, and use the following command to build it: mkdir build cd build cmake .. make But it returns the following error: CMake Error: File /ext/kmerind/src/config/config.hpp.in does not exist.

Can you suggest the way to build and use it? Thanks!

tcpan commented 5 years ago

Hi, Shuang, I need to update the instruction. There is a step missing.

You need to run the following from the source directory.

git submodule update --init --recursive --progress

That'll download the dependencies.

Then do cmake and make.

Please let me know if you run into further issues. Thanks!

Tony Pan

On Wed, Apr 17, 2019 at 11:48 AM Shuang Qiu notifications@github.com wrote:

Hi, I downloaded the software, and use the following command to build it: mkdir build cd build cmake .. make But it returns the following error: CMake Error: File /ext/kmerind/src/config/config.hpp.in does not exist.

Can you suggest the way to build and use it? Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5QKXLU7hllB5RtWtYLiZ_aDbX2MIdyks5vh0JhgaJpZM4c1YrM .

ShuangQiuac commented 5 years ago

Dear Tony Pan,

Thanks for the reply! I can build the program by executing the command "git submodule update --init —recursive” before cmake and make.

It generates “clear_cache” and “sys_probe” in the bin directory. Can you please provide further examples and instructions on how to run the program and how to specify parameters of the software? Thanks!

Best regards.

Shuang

在 2019年4月18日,下午12:12,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

Hi, Shuang, I need to update the instruction. There is a step missing.

You need to run the following from the source directory.

git submodule update --init --recursive --progress

That'll download the dependencies.

Then do cmake and make.

Please let me know if you run into further issues. Thanks!

Tony Pan

On Wed, Apr 17, 2019 at 11:48 AM Shuang Qiu notifications@github.com<mailto:notifications@github.com> wrote:

Hi, I downloaded the software, and use the following command to build it: mkdir build cd build cmake .. make But it returns the following error: CMake Error: File /ext/kmerind/src/config/config.hpp.inhttp://config.hpp.in does not exist.

Can you suggest the way to build and use it? Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5QKXLU7hllB5RtWtYLiZ_aDbX2MIdyks5vh0JhgaJpZM4c1YrM .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ParBLiSS/bruno/issues/1#issuecomment-484349139, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE6DTTXNH4HKP6ABDL2VYMLPQ7YKHANCNFSM4HGVRLGA.

tcpan commented 5 years ago

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

ShuangQiuac commented 5 years ago

Dear Tony,

Thanks for your instruction! Unfortunately I can not use ccmake under CentOS in our lab servers. Could you please provide other instructions on how to use it with only cmake and make? For example, if I want to run bruno on dataset human chromosome 14, how can I build the program, and what parameters, e.g. kmer length, minimizer length, should I specify running it?

Best regards.

Shuang

在 2019年4月18日,下午11:02,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA.

tcpan commented 5 years ago

Hi, Shuang, Let's get you compiling first.

Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This is the commandline way of changing cmake parameters. Next you can either do "make" to build everything or use the following to build specific binaries:

cmake --build . --target help

and pick the target binary you want, and run

make {targetname}

Once you have it running, you can invoke a binary with "--help" to see a list of its parameters, and the corresponding explanations. You probably will have some questions - please feel free to contact me.

The choice of k value depends on what you're trying to do. For computational benchmarking, k <= 32 has the advantage of being long enough to have some biological relevance while short enough to fit in a machine word. For real assembly of human genome, however, larger k works better for resolving repeat regions, for example Hipmer uses 55 for human and SPADES goes up to 77 in their default settings.

As I mentioned previously, our minimizer is not ready for use, and I am considering deprecating it completely. Please do not use it for genome assembly or performance benchmarking.

Thanks, and let me know what other questions you may have.

On Wed, May 1, 2019 at 1:04 AM Shuang Qiu notifications@github.com wrote:

Dear Tony,

Thanks for your instruction! Unfortunately I can not use ccmake under CentOS in our lab servers. Could you please provide other instructions on how to use it with only cmake and make? For example, if I want to run bruno on dataset human chromosome 14, how can I build the program, and what parameters, e.g. kmer length, minimizer length, should I specify running it?

Best regards.

Shuang

在 2019年4月18日,下午11:02,Tony Pan <notifications@github.com<mailto: notifications@github.com>> 写道:

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488211449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA .

ShuangQiuac commented 5 years ago

Hi, Tony,

Thanks for your reply! I can build the program now. Can you please specify what I should modify in the CMakeLists.txt, so that I can build a binary with K=29 and input file format = fasta?

Best regards.

Shuang

在 2019年5月1日,下午10:11,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

Hi, Shuang, Let's get you compiling first.

Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This is the commandline way of changing cmake parameters. Next you can either do "make" to build everything or use the following to build specific binaries:

cmake --build . --target help

and pick the target binary you want, and run

make {targetname}

Once you have it running, you can invoke a binary with "--help" to see a list of its parameters, and the corresponding explanations. You probably will have some questions - please feel free to contact me.

The choice of k value depends on what you're trying to do. For computational benchmarking, k <= 32 has the advantage of being long enough to have some biological relevance while short enough to fit in a machine word. For real assembly of human genome, however, larger k works better for resolving repeat regions, for example Hipmer uses 55 for human and SPADES goes up to 77 in their default settings.

As I mentioned previously, our minimizer is not ready for use, and I am considering deprecating it completely. Please do not use it for genome assembly or performance benchmarking.

Thanks, and let me know what other questions you may have.

On Wed, May 1, 2019 at 1:04 AM Shuang Qiu notifications@github.com<mailto:notifications@github.com> wrote:

Dear Tony,

Thanks for your instruction! Unfortunately I can not use ccmake under CentOS in our lab servers. Could you please provide other instructions on how to use it with only cmake and make? For example, if I want to run bruno on dataset human chromosome 14, how can I build the program, and what parameters, e.g. kmer length, minimizer length, should I specify running it?

Best regards.

Shuang

在 2019年4月18日,下午11:02,Tony Pan notifications@github.com<mailto:notifications@github.com<mailto: notifications@github.commailto:notifications@github.com>> 写道:

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488211449, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA.

tcpan commented 5 years ago

Hi, Shuang, Good to hear.

If you do not need bubble and deadend removal, then edit test/test/CMakeLists.txt,

  1. add "29" to line 164.
  2. uncomment lines 183-186 for the "freq" version of FASTA
  3. uncomment lines 195-198 for the "incr" version of FASTA

If you need bubble and deadend removal, then

  1. add "29" to line 209
  2. duplicate lines 218-221 and change all occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the duplicates.
  3. for the "incr" verison, duplicate lines 228-231, and change all occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the duplicates.

Then rerun cmake in your build directory, and compile the K29 versions of the targets.

That should be it. Please let me know if you run into any issues. Thanks!

Tony

On Wed, May 1, 2019 at 10:48 AM Shuang Qiu notifications@github.com wrote:

Hi, Tony,

Thanks for your reply! I can build the program now. Can you please specify what I should modify in the CMakeLists.txt, so that I can build a binary with K=29 and input file format = fasta?

Best regards.

Shuang

在 2019年5月1日,下午10:11,Tony Pan <notifications@github.com<mailto: notifications@github.com>> 写道:

Hi, Shuang, Let's get you compiling first.

Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This is the commandline way of changing cmake parameters. Next you can either do "make" to build everything or use the following to build specific binaries:

cmake --build . --target help

and pick the target binary you want, and run

make {targetname}

Once you have it running, you can invoke a binary with "--help" to see a list of its parameters, and the corresponding explanations. You probably will have some questions - please feel free to contact me.

The choice of k value depends on what you're trying to do. For computational benchmarking, k <= 32 has the advantage of being long enough to have some biological relevance while short enough to fit in a machine word. For real assembly of human genome, however, larger k works better for resolving repeat regions, for example Hipmer uses 55 for human and SPADES goes up to 77 in their default settings.

As I mentioned previously, our minimizer is not ready for use, and I am considering deprecating it completely. Please do not use it for genome assembly or performance benchmarking.

Thanks, and let me know what other questions you may have.

On Wed, May 1, 2019 at 1:04 AM Shuang Qiu <notifications@github.com mailto:notifications@github.com> wrote:

Dear Tony,

Thanks for your instruction! Unfortunately I can not use ccmake under CentOS in our lab servers. Could you please provide other instructions on how to use it with only cmake and make? For example, if I want to run bruno on dataset human chromosome 14, how can I build the program, and what parameters, e.g. kmer length, minimizer length, should I specify running it?

Best regards.

Shuang

在 2019年4月18日,下午11:02,Tony Pan <notifications@github.com<mailto: notifications@github.com><mailto: notifications@github.commailto:notifications@github.com>> 写道:

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, or mute the thread<

https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488211449, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488302993, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHFAKI3J73DZEIFMWQQFTTPTGUSRANCNFSM4HGVRLGA .

ShuangQiuac commented 5 years ago

Hi, Tony,

Thanks for your reply! Then how can I specify the input file? I didn’t see any parameter specification when I ran the compiled binary.

Best regards.

Shuang

在 2019年5月2日,上午1:20,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

Hi, Shuang, Good to hear.

If you do not need bubble and deadend removal, then edit test/test/CMakeLists.txt,

  1. add "29" to line 164.
  2. uncomment lines 183-186 for the "freq" version of FASTA
  3. uncomment lines 195-198 for the "incr" version of FASTA

If you need bubble and deadend removal, then

  1. add "29" to line 209
  2. duplicate lines 218-221 and change all occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the duplicates.
  3. for the "incr" verison, duplicate lines 228-231, and change all occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the duplicates.

Then rerun cmake in your build directory, and compile the K29 versions of the targets.

That should be it. Please let me know if you run into any issues. Thanks!

Tony

On Wed, May 1, 2019 at 10:48 AM Shuang Qiu notifications@github.com<mailto:notifications@github.com> wrote:

Hi, Tony,

Thanks for your reply! I can build the program now. Can you please specify what I should modify in the CMakeLists.txt, so that I can build a binary with K=29 and input file format = fasta?

Best regards.

Shuang

在 2019年5月1日,下午10:11,Tony Pan notifications@github.com<mailto:notifications@github.com<mailto: notifications@github.commailto:notifications@github.com>> 写道:

Hi, Shuang, Let's get you compiling first.

Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This is the commandline way of changing cmake parameters. Next you can either do "make" to build everything or use the following to build specific binaries:

cmake --build . --target help

and pick the target binary you want, and run

make {targetname}

Once you have it running, you can invoke a binary with "--help" to see a list of its parameters, and the corresponding explanations. You probably will have some questions - please feel free to contact me.

The choice of k value depends on what you're trying to do. For computational benchmarking, k <= 32 has the advantage of being long enough to have some biological relevance while short enough to fit in a machine word. For real assembly of human genome, however, larger k works better for resolving repeat regions, for example Hipmer uses 55 for human and SPADES goes up to 77 in their default settings.

As I mentioned previously, our minimizer is not ready for use, and I am considering deprecating it completely. Please do not use it for genome assembly or performance benchmarking.

Thanks, and let me know what other questions you may have.

On Wed, May 1, 2019 at 1:04 AM Shuang Qiu notifications@github.com<mailto:notifications@github.com mailto:notifications@github.com> wrote:

Dear Tony,

Thanks for your instruction! Unfortunately I can not use ccmake under CentOS in our lab servers. Could you please provide other instructions on how to use it with only cmake and make? For example, if I want to run bruno on dataset human chromosome 14, how can I build the program, and what parameters, e.g. kmer length, minimizer length, should I specify running it?

Best regards.

Shuang

在 2019年4月18日,下午11:02,Tony Pan notifications@github.com<mailto:notifications@github.com<mailto: notifications@github.commailto:notifications@github.com><mailto: notifications@github.commailto:notifications@github.commailto:notifications@github.com>> 写道:

Hi, Shuang, It looks like I forgot to turn on the "Build Example Applications" by default.

Can you try to use ccmake instead of cmake

ccmake src_dir

which will present a graphical user interface for configuring the project. The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to "ON". then press c to configure and g to "generate and exit".

Now this is going to create a somewhat large number of targets (what I needed during my evaluation). You can see a list of the build target by

cmake --build . --target help

At this point if you run "make" (or "make -j 4"), everything will be built and it might take a while, or you can run "make {target}", where {target} is one of the targets listed. All the targets came about due to c++ templating and a desire to reduce individual binary size and to avoid excessive branching in the code.

Now some explanation of the target naming conventions. Here are a couple of example targets

"compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" "compact_debruijn_graph_fastq_A4_K31_freq_minimizer"

fastq: means it operates on fastq files. We can easily support fasta files as well - I'll explain how in a little bit. A4: standard 2bit DNA encoding. The other alternative is A16, which supports 4 bit DNA encoding (IUPAC) K21: kmer length. the cmake script is currently configured with 21, 31,51, 55, and 63. We can easily support others as well. freq: this is my "code name" for an optimized graph construction algorithm. You should by default choose binaries with this label. clean and clean_recompact: bubbles and deadends are removed and chains are recompacted. I used some simple criteria for identifying bubbles and deadends, and they may not be what you want. The code is set up so that an application developer can define their own criteria, but this requires some c++ coding. minimizer: attempt at using minimizers for data distribution across multiple nodes - not performing well yet. You should avoid these. incr: for when the input files pushes memory limit. This is data dependent (number of unique k-mers), but you may want to try using these incremental version if you have multiple files in your dataset and the fastq files are more than 1/16 of the total memory (a guess).

To support FASTA files and other k values, we just need to change the CMakeLists.txt file to generate the appropriate targets. I can show you how to do those.

To summarize quickly, use the versions with "A4" and "freq" labels. If you think you'll run out of memory, try the "incr" version. If you need fasta file or other k-values support, let me know. If you need to remove bubbles and dead ends, we should talk.

Making the configure process easier has been on my things to do for a while. I'll try to find some time to work on this.

Thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, or mute the thread<

https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488211449, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA>

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA>.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488302993, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHFAKI3J73DZEIFMWQQFTTPTGUSRANCNFSM4HGVRLGA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ParBLiSS/bruno/issues/1#issuecomment-488348308, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE6DTTWL2UPLGDPFYIDEHHLPTHGPFANCNFSM4HGVRLGA.

tcpan commented 5 years ago

Hi, Shuang, Can you verify that when you call the binary you get at least something like this?

EXECUTING bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact PARSE ERROR: Required argument missing: filenames

Brief USAGE: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact [-M] [-C] [-N] [-R] [-B] [-U ] ... [-L ] ... [-T] [-O

] [--] [--version] [-h] ... For complete USAGE and HELP type: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact --help You can add the "--help" switch to see the full parameter list. Let me know if you have any questions - the choice of switches will depend on what your goals are - benchmarking, generating and writing out the contigs, with or without bubble and deadend cleaning, etc. You can list all fasta files at the end of the command. I also want to re-emphasize that the bubble and deadend cleaning is meant as a demonstration of the library's capabilitiy and is based on my definition of bubbles and deadends. If you need graph cleaning, we should talk to make sure your desired logic is implemented. Thanks! Tony On Fri, May 3, 2019 at 12:49 AM Shuang Qiu wrote: > Hi, Tony, > > Thanks for your reply! Then how can I specify the input file? I didn’t see > any parameter specification when I ran the compiled binary. > > Best regards. > > Shuang > > 在 2019年5月2日,上午1:20,Tony Pan notifications@github.com>> 写道: > > Hi, Shuang, > Good to hear. > > If you do not need bubble and deadend removal, then edit > test/test/CMakeLists.txt, > > 1. add "29" to line 164. > 2. uncomment lines 183-186 for the "freq" version of FASTA > 3. uncomment lines 195-198 for the "incr" version of FASTA > > If you need bubble and deadend removal, then > > 1. add "29" to line 209 > 2. duplicate lines 218-221 and change all occurrences of "fastq" and > "FASTQ" to "fasta" and "FASTA" in the duplicates. > 3. for the "incr" verison, duplicate lines 228-231, and change all > occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the > duplicates. > > > Then rerun cmake in your build directory, and compile the K29 versions of > the targets. > > That should be it. Please let me know if you run into any issues. Thanks! > > Tony > > On Wed, May 1, 2019 at 10:48 AM Shuang Qiu > wrote: > > > Hi, Tony, > > > > Thanks for your reply! I can build the program now. Can you please > specify > > what I should modify in the CMakeLists.txt, so that I can build a binary > > with K=29 and input file format = fasta? > > > > Best regards. > > > > Shuang > > > > 在 2019年5月1日,下午10:11,Tony Pan notifications@github.com> > notifications@github.com>> 写道: > > > > Hi, Shuang, > > Let's get you compiling first. > > > > Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This > > is the commandline way of changing cmake parameters. Next you can either > > do "make" to build everything or use the following to build specific > > binaries: > > > > > > cmake --build . --target help > > > > and pick the target binary you want, and run > > > > make {targetname} > > > > Once you have it running, you can invoke a binary with "--help" to see a > > list of its parameters, and the corresponding explanations. You probably > > will have some questions - please feel free to contact me. > > > > > > The choice of k value depends on what you're trying to do. For > > computational benchmarking, k <= 32 has the advantage of being long > enough > > to have some biological relevance while short enough to fit in a machine > > word. For real assembly of human genome, however, larger k works better > > for resolving repeat regions, for example Hipmer uses 55 for human and > > SPADES goes up to 77 in their default settings. > > > > As I mentioned previously, our minimizer is not ready for use, and I am > > considering deprecating it completely. Please do not use it for genome > > assembly or performance benchmarking. > > > > > > Thanks, and let me know what other questions you may have. > > > > On Wed, May 1, 2019 at 1:04 AM Shuang Qiu > > > wrote: > > > > > Dear Tony, > > > > > > Thanks for your instruction! Unfortunately I can not use ccmake under > > > CentOS in our lab servers. Could you please provide other instructions > > on > > > how to use it with only cmake and make? For example, if I want to run > > bruno > > > on dataset human chromosome 14, how can I build the program, and what > > > parameters, e.g. kmer length, minimizer length, should I specify > running > > > it? > > > > > > Best regards. > > > > > > Shuang > > > > > > 在 2019年4月18日,下午11:02,Tony Pan notifications@github.com> > notifications@github.com> > > notifications@github.com notifications@github.com>>> 写道: > > > > > > > > > Hi, Shuang, > > > It looks like I forgot to turn on the "Build Example Applications" by > > > default. > > > > > > Can you try to use ccmake instead of cmake > > > > > > ccmake src_dir > > > > > > which will present a graphical user interface for configuring the > > project. > > > The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to > > "ON". > > > then press c to configure and g to "generate and exit". > > > > > > Now this is going to create a somewhat large number of targets (what I > > > needed during my evaluation). You can see a list of the build target > by > > > > > > cmake --build . --target help > > > > > > At this point if you run "make" (or "make -j 4"), everything will be > > built > > > and it might take a while, or you can run "make {target}", where > > {target} > > > is one of the targets listed. All the targets came about due to c++ > > > templating and a desire to reduce individual binary size and to avoid > > > excessive branching in the code. > > > > > > Now some explanation of the target naming conventions. Here are a > couple > > > of example targets > > > > > > "compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" > > > "compact_debruijn_graph_fastq_A4_K31_freq_minimizer" > > > > > > fastq: means it operates on fastq files. We can easily support fasta > > files > > > as well - I'll explain how in a little bit. > > > A4: standard 2bit DNA encoding. The other alternative is A16, which > > > supports 4 bit DNA encoding (IUPAC) > > > K21: kmer length. the cmake script is currently configured with 21, > > 31,51, > > > 55, and 63. We can easily support others as well. > > > freq: this is my "code name" for an optimized graph construction > > > algorithm. You should by default choose binaries with this label. > > > clean and clean_recompact: bubbles and deadends are removed and chains > > are > > > recompacted. I used some simple criteria for identifying bubbles and > > > deadends, and they may not be what you want. The code is set up so > that > > an > > > application developer can define their own criteria, but this requires > > some > > > c++ coding. > > > minimizer: attempt at using minimizers for data distribution across > > > multiple nodes - not performing well yet. You should avoid these. > > > incr: for when the input files pushes memory limit. This is data > > dependent > > > (number of unique k-mers), but you may want to try using these > > incremental > > > version if you have multiple files in your dataset and the fastq files > > are > > > more than 1/16 of the total memory (a guess). > > > > > > To support FASTA files and other k values, we just need to change the > > > CMakeLists.txt file to generate the appropriate targets. I can show > you > > how > > > to do those. > > > > > > To summarize quickly, use the versions with "A4" and "freq" labels. If > > you > > > think you'll run out of memory, try the "incr" version. If you need > > fasta > > > file or other k-values support, let me know. If you need to remove > > bubbles > > > and dead ends, we should talk. > > > > > > Making the configure process easier has been on my things to do for a > > > while. I'll try to find some time to work on this. > > > > > > Thanks > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub< > > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, > or > > > mute the thread< > > > > > > https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>. > > > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > , > or > > mute > > > the thread > > > < > > > https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA> > > > > > > . > > > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub< > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724>, or > > mute the thread< > > > https://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA>. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > , or > mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/AAHFAKI3J73DZEIFMWQQFTTPTGUSRANCNFSM4HGVRLGA> > > > . > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub< > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488348308>, or > mute the thread< > https://github.com/notifications/unsubscribe-auth/AE6DTTWL2UPLGDPFYIDEHHLPTHGPFANCNFSM4HGVRLGA>. > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > , or mute > the thread > > . >
ShuangQiuac commented 5 years ago

Hi, Tony,

Thanks for your reply! when I call the binary, it didn’t ask for any input parameter. It just ran and output results as follows,

READING /test/data/test.debruijn.small.fastq via posix total size read is 939 PARSING and INSERT rank 0 BEFFORE input=210 size=0 buckets=512 rank 0 AFTER input=210 size=140 reported=140 buckets=512 PARSING and INSERT DONE: total size after insert/rehash is 140 HISTOGRAM TOTAL Edge Existence Histogram: 0 1 2 3 4 0 0 5 0 0 0 1 1 132 0 1 0 2 0 0 0 0 0 3 0 1 0 0 0 4 0 0 0 0 0 rank 0 finished checking index PRINT BRANCHES PRINT BRANCH KMERS SIZES simple biedge size: 24 kmer size 8 node size 32 MAKE CHAINMAP MARK TERMINI NEXT TO BRANCHES estimate available mem=124400472064 bytes, p=1, alloc 124400472064 elements estimate num chain terminal updates=140, value_type size=24 bytes LIST RANKING rank 0 iter 1 updated 264, unfinished 137 internal chain nodes 117 rank 0 iter 2 updated 260, unfinished 137 internal chain nodes 97 rank 0 iter 3 updated 248, unfinished 137 internal chain nodes 57 rank 0 iter 4 updated 214, unfinished 114 internal chain nodes 0 REMOVE ISOLATED REMOVED 0 isolated nodes rank 0/1 input is EMPTY. rank 0 BEFORE input=210 size=0 buckets=512 rank 0 AFTER input=210 size=140 buckets=512 rank 0 map_base get_multiplicity called rank 0 BEFORE input=138 size=0 buckets=512 rank 0 AFTER input=138 size=6 buckets=512 PRINT CHAIN String PRINT CHAIN Nodes COMPUTE CHAIN FREQ SUMMARY GATHER NON_REP_END EDGE FREQUENCY rank 0 result size 6 capacity 7 CREATE CHAIN EDGE FREQUENCIES PRINT CHAIN EDGE FREQS

Best regards.

Shuang 在 2019年5月3日,下午8:46,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

Hi, Shuang, Can you verify that when you call the binary you get at least something like this?

EXECUTING bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact PARSE ERROR: Required argument missing: filenames

Brief USAGE: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact [-M] [-C] [-N] [-R] [-B] [-U ] ... [-L ] ... [-T] [-O

] [--] [--version] [-h] ... For complete USAGE and HELP type: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact --help You can add the "--help" switch to see the full parameter list. Let me know if you have any questions - the choice of switches will depend on what your goals are - benchmarking, generating and writing out the contigs, with or without bubble and deadend cleaning, etc. You can list all fasta files at the end of the command. I also want to re-emphasize that the bubble and deadend cleaning is meant as a demonstration of the library's capabilitiy and is based on my definition of bubbles and deadends. If you need graph cleaning, we should talk to make sure your desired logic is implemented. Thanks! Tony On Fri, May 3, 2019 at 12:49 AM Shuang Qiu > wrote: > Hi, Tony, > > Thanks for your reply! Then how can I specify the input file? I didn’t see > any parameter specification when I ran the compiled binary. > > Best regards. > > Shuang > > 在 2019年5月2日,上午1:20,Tony Pan notifications@github.com>> 写道: > > Hi, Shuang, > Good to hear. > > If you do not need bubble and deadend removal, then edit > test/test/CMakeLists.txt, > > 1. add "29" to line 164. > 2. uncomment lines 183-186 for the "freq" version of FASTA > 3. uncomment lines 195-198 for the "incr" version of FASTA > > If you need bubble and deadend removal, then > > 1. add "29" to line 209 > 2. duplicate lines 218-221 and change all occurrences of "fastq" and > "FASTQ" to "fasta" and "FASTA" in the duplicates. > 3. for the "incr" verison, duplicate lines 228-231, and change all > occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the > duplicates. > > > Then rerun cmake in your build directory, and compile the K29 versions of > the targets. > > That should be it. Please let me know if you run into any issues. Thanks! > > Tony > > On Wed, May 1, 2019 at 10:48 AM Shuang Qiu > > wrote: > > > Hi, Tony, > > > > Thanks for your reply! I can build the program now. Can you please > specify > > what I should modify in the CMakeLists.txt, so that I can build a binary > > with K=29 and input file format = fasta? > > > > Best regards. > > > > Shuang > > > > 在 2019年5月1日,下午10:11,Tony Pan notifications@github.com> > notifications@github.com>> 写道: > > > > Hi, Shuang, > > Let's get you compiling first. > > > > Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This > > is the commandline way of changing cmake parameters. Next you can either > > do "make" to build everything or use the following to build specific > > binaries: > > > > > > cmake --build . --target help > > > > and pick the target binary you want, and run > > > > make {targetname} > > > > Once you have it running, you can invoke a binary with "--help" to see a > > list of its parameters, and the corresponding explanations. You probably > > will have some questions - please feel free to contact me. > > > > > > The choice of k value depends on what you're trying to do. For > > computational benchmarking, k <= 32 has the advantage of being long > enough > > to have some biological relevance while short enough to fit in a machine > > word. For real assembly of human genome, however, larger k works better > > for resolving repeat regions, for example Hipmer uses 55 for human and > > SPADES goes up to 77 in their default settings. > > > > As I mentioned previously, our minimizer is not ready for use, and I am > > considering deprecating it completely. Please do not use it for genome > > assembly or performance benchmarking. > > > > > > Thanks, and let me know what other questions you may have. > > > > On Wed, May 1, 2019 at 1:04 AM Shuang Qiu > > > > wrote: > > > > > Dear Tony, > > > > > > Thanks for your instruction! Unfortunately I can not use ccmake under > > > CentOS in our lab servers. Could you please provide other instructions > > on > > > how to use it with only cmake and make? For example, if I want to run > > bruno > > > on dataset human chromosome 14, how can I build the program, and what > > > parameters, e.g. kmer length, minimizer length, should I specify > running > > > it? > > > > > > Best regards. > > > > > > Shuang > > > > > > 在 2019年4月18日,下午11:02,Tony Pan notifications@github.com> > notifications@github.com> > > notifications@github.com notifications@github.com>>> 写道: > > > > > > > > > Hi, Shuang, > > > It looks like I forgot to turn on the "Build Example Applications" by > > > default. > > > > > > Can you try to use ccmake instead of cmake > > > > > > ccmake src_dir > > > > > > which will present a graphical user interface for configuring the > > project. > > > The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to > > "ON". > > > then press c to configure and g to "generate and exit". > > > > > > Now this is going to create a somewhat large number of targets (what I > > > needed during my evaluation). You can see a list of the build target > by > > > > > > cmake --build . --target help > > > > > > At this point if you run "make" (or "make -j 4"), everything will be > > built > > > and it might take a while, or you can run "make {target}", where > > {target} > > > is one of the targets listed. All the targets came about due to c++ > > > templating and a desire to reduce individual binary size and to avoid > > > excessive branching in the code. > > > > > > Now some explanation of the target naming conventions. Here are a > couple > > > of example targets > > > > > > "compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" > > > "compact_debruijn_graph_fastq_A4_K31_freq_minimizer" > > > > > > fastq: means it operates on fastq files. We can easily support fasta > > files > > > as well - I'll explain how in a little bit. > > > A4: standard 2bit DNA encoding. The other alternative is A16, which > > > supports 4 bit DNA encoding (IUPAC) > > > K21: kmer length. the cmake script is currently configured with 21, > > 31,51, > > > 55, and 63. We can easily support others as well. > > > freq: this is my "code name" for an optimized graph construction > > > algorithm. You should by default choose binaries with this label. > > > clean and clean_recompact: bubbles and deadends are removed and chains > > are > > > recompacted. I used some simple criteria for identifying bubbles and > > > deadends, and they may not be what you want. The code is set up so > that > > an > > > application developer can define their own criteria, but this requires > > some > > > c++ coding. > > > minimizer: attempt at using minimizers for data distribution across > > > multiple nodes - not performing well yet. You should avoid these. > > > incr: for when the input files pushes memory limit. This is data > > dependent > > > (number of unique k-mers), but you may want to try using these > > incremental > > > version if you have multiple files in your dataset and the fastq files > > are > > > more than 1/16 of the total memory (a guess). > > > > > > To support FASTA files and other k values, we just need to change the > > > CMakeLists.txt file to generate the appropriate targets. I can show > you > > how > > > to do those. > > > > > > To summarize quickly, use the versions with "A4" and "freq" labels. If > > you > > > think you'll run out of memory, try the "incr" version. If you need > > fasta > > > file or other k-values support, let me know. If you need to remove > > bubbles > > > and dead ends, we should talk. > > > > > > Making the configure process easier has been on my things to do for a > > > while. I'll try to find some time to work on this. > > > > > > Thanks > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub< > > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, > or > > > mute the thread< > > > > > > https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>. > > > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > , > or > > mute > > > the thread > > > < > > > https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA> > > > > > > . > > > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub< > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724>, or > > mute the thread< > > > https://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA>. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > , or > mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/AAHFAKI3J73DZEIFMWQQFTTPTGUSRANCNFSM4HGVRLGA> > > > . > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub< > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488348308>, or > mute the thread< > https://github.com/notifications/unsubscribe-auth/AE6DTTWL2UPLGDPFYIDEHHLPTHGPFANCNFSM4HGVRLGA>. > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > , or mute > the thread > > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
tcpan commented 5 years ago

Hi, Shuang, Sorry for the late reply - was on vacation.

Can you verify the exact command you used? And also the git commit number you used?

What do you get when you call? "bin/compact_debruijn_graph_fastq_A4_K31_freq --help"

The binaries are not interactive. The parameters need to be specified at the commandline as parameters to the binary. When you call the binary without any parameters, it directly starts execution using the default parameters, which uses the included test data - essentially it becomes an integration test run.

Tony

On Fri, May 24, 2019 at 7:45 AM Shuang Qiu notifications@github.com wrote:

Hi, Tony,

Thanks for your reply! when I call the binary, it didn’t ask for any input parameter. It just ran and output results as follows,

READING /test/data/test.debruijn.small.fastq via posix total size read is 939 PARSING and INSERT rank 0 BEFFORE input=210 size=0 buckets=512 rank 0 AFTER input=210 size=140 reported=140 buckets=512 PARSING and INSERT DONE: total size after insert/rehash is 140 HISTOGRAM TOTAL Edge Existence Histogram: 0 1 2 3 4 0 0 5 0 0 0 1 1 132 0 1 0 2 0 0 0 0 0 3 0 1 0 0 0 4 0 0 0 0 0 rank 0 finished checking index PRINT BRANCHES PRINT BRANCH KMERS SIZES simple biedge size: 24 kmer size 8 node size 32 MAKE CHAINMAP MARK TERMINI NEXT TO BRANCHES estimate available mem=124400472064 bytes, p=1, alloc 124400472064 elements estimate num chain terminal updates=140, value_type size=24 bytes LIST RANKING rank 0 iter 1 updated 264, unfinished 137 internal chain nodes 117 rank 0 iter 2 updated 260, unfinished 137 internal chain nodes 97 rank 0 iter 3 updated 248, unfinished 137 internal chain nodes 57 rank 0 iter 4 updated 214, unfinished 114 internal chain nodes 0 REMOVE ISOLATED REMOVED 0 isolated nodes rank 0/1 input is EMPTY. rank 0 BEFORE input=210 size=0 buckets=512 rank 0 AFTER input=210 size=140 buckets=512 rank 0 map_base get_multiplicity called rank 0 BEFORE input=138 size=0 buckets=512 rank 0 AFTER input=138 size=6 buckets=512 PRINT CHAIN String PRINT CHAIN Nodes COMPUTE CHAIN FREQ SUMMARY GATHER NON_REP_END EDGE FREQUENCY rank 0 result size 6 capacity 7 CREATE CHAIN EDGE FREQUENCIES PRINT CHAIN EDGE FREQS

Best regards.

Shuang 在 2019年5月3日,下午8:46,Tony Pan <notifications@github.com<mailto: notifications@github.com>> 写道:

Hi, Shuang, Can you verify that when you call the binary you get at least something like this?

EXECUTING bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact PARSE ERROR: Required argument missing: filenames

Brief USAGE: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact [-M] [-C] [-N] [-R] [-B] [-U ] ... [-L ] ... [-T] [-O

] [--] [--version] [-h] ... For complete USAGE and HELP type: bin/compact_debruijn_graph_fastq_A4_K31_freq_clean_recompact --help You can add the "--help" switch to see the full parameter list. Let me know if you have any questions - the choice of switches will depend on what your goals are - benchmarking, generating and writing out the contigs, with or without bubble and deadend cleaning, etc. You can list all fasta files at the end of the command. I also want to re-emphasize that the bubble and deadend cleaning is meant as a demonstration of the library's capabilitiy and is based on my definition of bubbles and deadends. If you need graph cleaning, we should talk to make sure your desired logic is implemented. Thanks! Tony On Fri, May 3, 2019 at 12:49 AM Shuang Qiu > wrote: > Hi, Tony, > > Thanks for your reply! Then how can I specify the input file? I didn’t see > any parameter specification when I ran the compiled binary. > > Best regards. > > Shuang > > 在 2019年5月2日,上午1:20,Tony Pan notifications@github.com>> 写道: > > Hi, Shuang, > Good to hear. > > If you do not need bubble and deadend removal, then edit > test/test/CMakeLists.txt, > > 1. add "29" to line 164. > 2. uncomment lines 183-186 for the "freq" version of FASTA > 3. uncomment lines 195-198 for the "incr" version of FASTA > > If you need bubble and deadend removal, then > > 1. add "29" to line 209 > 2. duplicate lines 218-221 and change all occurrences of "fastq" and > "FASTQ" to "fasta" and "FASTA" in the duplicates. > 3. for the "incr" verison, duplicate lines 228-231, and change all > occurrences of "fastq" and "FASTQ" to "fasta" and "FASTA" in the > duplicates. > > > Then rerun cmake in your build directory, and compile the K29 versions of > the targets. > > That should be it. Please let me know if you run into any issues. Thanks! > > Tony > > On Wed, May 1, 2019 at 10:48 AM Shuang Qiu > > wrote: > > > Hi, Tony, > > > > Thanks for your reply! I can build the program now. Can you please > specify > > what I should modify in the CMakeLists.txt, so that I can build a binary > > with K=29 and input file format = fasta? > > > > Best regards. > > > > Shuang > > > > 在 2019年5月1日,下午10:11,Tony Pan notifications@github.com> > notifications@github.com>> 写道: > > > > Hi, Shuang, > > Let's get you compiling first. > > > > Try adding "-DBUILD_EXAMPLE_APPLICATIONS=ON" to your cmake command. This > > is the commandline way of changing cmake parameters. Next you can either > > do "make" to build everything or use the following to build specific > > binaries: > > > > > > cmake --build . --target help > > > > and pick the target binary you want, and run > > > > make {targetname} > > > > Once you have it running, you can invoke a binary with "--help" to see a > > list of its parameters, and the corresponding explanations. You probably > > will have some questions - please feel free to contact me. > > > > > > The choice of k value depends on what you're trying to do. For > > computational benchmarking, k <= 32 has the advantage of being long > enough > > to have some biological relevance while short enough to fit in a machine > > word. For real assembly of human genome, however, larger k works better > > for resolving repeat regions, for example Hipmer uses 55 for human and > > SPADES goes up to 77 in their default settings. > > > > As I mentioned previously, our minimizer is not ready for use, and I am > > considering deprecating it completely. Please do not use it for genome > > assembly or performance benchmarking. > > > > > > Thanks, and let me know what other questions you may have. > > > > On Wed, May 1, 2019 at 1:04 AM Shuang Qiu > > > > wrote: > > > > > Dear Tony, > > > > > > Thanks for your instruction! Unfortunately I can not use ccmake under > > > CentOS in our lab servers. Could you please provide other instructions > > on > > > how to use it with only cmake and make? For example, if I want to run > > bruno > > > on dataset human chromosome 14, how can I build the program, and what > > > parameters, e.g. kmer length, minimizer length, should I specify > running > > > it? > > > > > > Best regards. > > > > > > Shuang > > > > > > 在 2019年4月18日,下午11:02,Tony Pan notifications@github.com> > notifications@github.com> > > notifications@github.com notifications@github.com>>> 写道: > > > > > > > > > Hi, Shuang, > > > It looks like I forgot to turn on the "Build Example Applications" by > > > default. > > > > > > Can you try to use ccmake instead of cmake > > > > > > ccmake src_dir > > > > > > which will present a graphical user interface for configuring the > > project. > > > The first item should be BUILD_EXAMPLE_APPLICATIONS. change that to > > "ON". > > > then press c to configure and g to "generate and exit". > > > > > > Now this is going to create a somewhat large number of targets (what I > > > needed during my evaluation). You can see a list of the build target > by > > > > > > cmake --build . --target help > > > > > > At this point if you run "make" (or "make -j 4"), everything will be > > built > > > and it might take a while, or you can run "make {target}", where > > {target} > > > is one of the targets listed. All the targets came about due to c++ > > > templating and a desire to reduce individual binary size and to avoid > > > excessive branching in the code. > > > > > > Now some explanation of the target naming conventions. Here are a > couple > > > of example targets > > > > > > "compact_debruijn_graph_fastq_A4_K21_freq_clean_recompact_incr" > > > "compact_debruijn_graph_fastq_A4_K31_freq_minimizer" > > > > > > fastq: means it operates on fastq files. We can easily support fasta > > files > > > as well - I'll explain how in a little bit. > > > A4: standard 2bit DNA encoding. The other alternative is A16, which > > > supports 4 bit DNA encoding (IUPAC) > > > K21: kmer length. the cmake script is currently configured with 21, > > 31,51, > > > 55, and 63. We can easily support others as well. > > > freq: this is my "code name" for an optimized graph construction > > > algorithm. You should by default choose binaries with this label. > > > clean and clean_recompact: bubbles and deadends are removed and chains > > are > > > recompacted. I used some simple criteria for identifying bubbles and > > > deadends, and they may not be what you want. The code is set up so > that > > an > > > application developer can define their own criteria, but this requires > > some > > > c++ coding. > > > minimizer: attempt at using minimizers for data distribution across > > > multiple nodes - not performing well yet. You should avoid these. > > > incr: for when the input files pushes memory limit. This is data > > dependent > > > (number of unique k-mers), but you may want to try using these > > incremental > > > version if you have multiple files in your dataset and the fastq files > > are > > > more than 1/16 of the total memory (a guess). > > > > > > To support FASTA files and other k values, we just need to change the > > > CMakeLists.txt file to generate the appropriate targets. I can show > you > > how > > > to do those. > > > > > > To summarize quickly, use the versions with "A4" and "freq" labels. If > > you > > > think you'll run out of memory, try the "incr" version. If you need > > fasta > > > file or other k-values support, let me know. If you need to remove > > bubbles > > > and dead ends, we should talk. > > > > > > Making the configure process easier has been on my things to do for a > > > while. I'll try to find some time to work on this. > > > > > > Thanks > > > > > > — > > > You are receiving this because you authored the thread. > > > Reply to this email directly, view it on GitHub< > > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-484546962>, > or > > > mute the thread< > > > > > > https://github.com/notifications/unsubscribe-auth/AE6DTTTRVED4TRUQ463R453PRCEQRANCNFSM4HGVRLGA>. > > > > > > > > > > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > > > , > or > > mute > > > the thread > > > < > > > https://github.com/notifications/unsubscribe-auth/AAHFAKNWSIIP6SR6W7BYBJDPTEQGRANCNFSM4HGVRLGA> > > > > > > . > > > > > > > — > > You are receiving this because you authored the thread. > > Reply to this email directly, view it on GitHub< > > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488292724>, or > > mute the thread< > > > https://github.com/notifications/unsubscribe-auth/AE6DTTSHQV7CDBKIPHBNBGDPTGQKJANCNFSM4HGVRLGA>. > > > > > > > — > > You are receiving this because you commented. > > Reply to this email directly, view it on GitHub > > , or > mute > > the thread > > < > https://github.com/notifications/unsubscribe-auth/AAHFAKI3J73DZEIFMWQQFTTPTGUSRANCNFSM4HGVRLGA> > > > . > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub< > https://github.com/ParBLiSS/bruno/issues/1#issuecomment-488348308>, or > mute the thread< > https://github.com/notifications/unsubscribe-auth/AE6DTTWL2UPLGDPFYIDEHHLPTHGPFANCNFSM4HGVRLGA>. > > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > , or mute > the thread > < https://github.com/notifications/unsubscribe-auth/AAHFAKNJWLTZ5FWGW3COQR3PTO75ZANCNFSM4HGVRLGA> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/ParBLiSS/bruno/issues/1#issuecomment-489083250>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AE6DTTVVBO52TH4UYLFOUXLPTQX3RANCNFSM4HGVRLGA>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or mute the thread .
ShuangQiuac commented 5 years ago

Hi, Tony,

Thanks a lot for your reply.

Sorry, I previously missed the last parameter for inputing a file, as listed with —help (but the program does not ask the input parameter if I directly execute ./compact_debruijn_graph_fastq_A4_K31. It directly input the test data.)

I have another question: how to specify the number of CPU threads/processes to run it? It seems that only one CPU core is used by default when I execute ./compact_debruijn_graph_fastq_A4_K31

The git commit number is db82d6fd1f0850dc5e5a70cc978619954d714ce5.

Best regards.

Shuang

在 2019年5月31日,下午10:51,Tony Pan notifications@github.com<mailto:notifications@github.com> 写道:

interactive

tcpan commented 5 years ago

Hi, Shuang, Yeah, I should have mentioned this part about multi-core. The binary is an MPI program. What you need to do is use one of the MPI flavors: OpenMPI, MPICH, MVAPICH, or on Cray systems is Cray MPI. typically, there is an mpirun or mpiexec command that you'd prefix the binary. You'd also specify the cores/processes as parameter to the mpirun/mpiexec command. For example, for OpenMPI, the process count is specified by "-np", so the commandline might look like

mpirun -np 16 ./compact_debruijn_graph_fastq_A4_K31

Without using mpirun, the bruno binaries essentially runs as single threaded.

Unfortunately, each MPI implementation may call their command differently, as well as the set of flags. Furthermore, to have it run well, the MPI processes should be pinned to the cores. Each MPI implementation again has its own way of doing this. Finally, if you are using a job scheduler, that will also have impact on how the processes are assigned to cores.

Since you were able to compile, I assume you have an MPI installation on your system. If you can let me know which mpi you are using, and which job schedule if any, I'd be able to better tell you what switches are needed.

Tony

On Thu, Jun 6, 2019 at 2:47 AM Shuang Qiu notifications@github.com wrote:

Hi, Tony,

Thanks a lot for your reply.

Sorry, I previously missed the last parameter for inputing a file, as listed with —help (but the program does not ask the input parameter if I directly execute ./compact_debruijn_graph_fastq_A4_K31. It directly input the test data.)

I have another question: how to specify the number of CPU threads/processes to run it? It seems that only one CPU core is used by default when I execute ./compact_debruijn_graph_fastq_A4_K31 <input fastq file>

The git commit number is db82d6fd1f0850dc5e5a70cc978619954d714ce5.

Best regards.

Shuang

在 2019年5月31日,下午10:51,Tony Pan <notifications@github.com<mailto: notifications@github.com>> 写道:

interactive

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ParBLiSS/bruno/issues/1?email_source=notifications&email_token=AAHFAKKEXMP3PTIYRHMHMLLPZCXHJA5CNFSM4HGVRLGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXB4YHA#issuecomment-499371036, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHFAKL7D5AJKX56YMVMQ4LPZCXHJANCNFSM4HGVRLGA .